Nodes going red

RobFantini · Nov 21, 2015

Q-wulf said:
I know. hence my comments (check #38).

ps.: i do not think you see any edits i made past the initial post.

that is correct, I just realized that emails do not get sent when a post is edited, only when post is added.... I'll check my forum settings.

RobFantini · Nov 23, 2015

this morning
sys5 all green ,
dell1 and sys3 show localhost green

In the past sys3 was all green.

there was a backup at sys5 last night

dell1:

Code:

dell1  ~ # omping -c 600 -i 1 -q  sys3-corosync sys5-corosync dell1-corosync
sys3-corosync : waiting for response msg
sys5-corosync : waiting for response msg
sys5-corosync : joined (S,G) = (*, 232.43.211.234), pinging
sys3-corosync : waiting for response msg
sys3-corosync : joined (S,G) = (*, 232.43.211.234), pinging
sys5-corosync : given amount of query messages was sent
sys3-corosync : given amount of query messages was sent

sys3-corosync :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.077/0.245/0.306/0.031
sys3-corosync : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.082/0.259/0.319/0.032
sys5-corosync :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.143/0.253/11.557/0.463
sys5-corosync : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%), min/avg/max/std-dev = 0.146/0.260/11.575/0.464

sys3:

Code:

sys3  ~ # omping -c 600 -i 1 -q  sys3-corosync sys5-corosync dell1-corosync
sys5-corosync  : waiting for response msg
dell1-corosync : waiting for response msg
sys5-corosync  : joined (S,G) = (*, 232.43.211.234), pinging
dell1-corosync : joined (S,G) = (*, 232.43.211.234), pinging
sys5-corosync  : given amount of query messages was sent
dell1-corosync : given amount of query messages was sent

sys5-corosync  :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.116/0.196/1.173/0.050
sys5-corosync  : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%), min/avg/max/std-dev = 0.124/0.214/1.192/0.050
dell1-corosync :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.143/0.256/3.950/0.157
dell1-corosync : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%), min/avg/max/std-dev = 0.160/0.268/3.960/0.157

sys5:

Code:

sys5  ~ # omping -c 600 -i 1 -q  sys3-corosync sys5-corosync dell1-corosync
sys3-corosync  : waiting for response msg
dell1-corosync : waiting for response msg
sys3-corosync  : waiting for response msg
dell1-corosync : waiting for response msg
sys3-corosync  : joined (S,G) = (*, 232.43.211.234), pinging
dell1-corosync : joined (S,G) = (*, 232.43.211.234), pinging
sys3-corosync  : given amount of query messages was sent
dell1-corosync : given amount of query messages was sent

sys3-corosync  :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.081/0.202/0.315/0.034
sys3-corosync  : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.108/0.224/0.322/0.032
dell1-corosync :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.145/0.238/0.346/0.036
dell1-corosync : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.164/0.249/0.355/0.036

ping test from sys5:

Code:

 both from sys5.
ping -c 1000 -i 0.1 10.2.8.181
ping -c 1000 -i 0.1 10.2.8.42

--- 10.2.8.42 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 99900ms
rtt min/avg/max/mdev = 0.064/0.176/0.287/0.029 ms
sys5  ~

--- 10.2.8.181 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 99897ms
rtt min/avg/max/mdev = 0.117/0.194/0.304/0.031 ms
sys5  ~ #

ping test from dell1... This was done after cluster was restarted so probably skewed.

Code:

ping -c 1000 -i 0.1 10.2.8.19
--- 10.2.8.19 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 99926ms
rtt min/avg/max/mdev = 0.101/0.185/0.263/0.035 ms

ping -c 1000 -i 0.1 10.2.8.42
--- 10.2.8.42 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 99923ms
rtt min/avg/max/mdev = 0.073/0.193/0.275/0.035 ms

Q-wulf · Nov 23, 2015

We know from this, that your multicast is working fine, when there is no issue on your server (we knew this since post #36).

I am assuming

this morning
sys5 all green ,
dell1 and sys3 show localhost green
In the past sys3 was all green.
there was a backup at sys5 last night

means that ...
on dell1 it shows sys3 and sys5 as red.
on sys3 it shows dell1 and sys5 as red.

The pattern i picked up on in post #36 ompings is not present on your latest ompings.

You could see if you can replicate the issue by doing a backup on sys3 (which seemed the common denominator).
If it does, run the ompings again. If you find excessively higher results on your min/avg/max/std-dev, then run the ping commands again.

RobFantini · Nov 23, 2015

Until this morning I assumed sys3 hardware was bad and was preparing to replace sys3. All vm's have been move off.

So the issue is unlikely hardware related - unless sys3 and sys5 are bad.

Since I do not see others with red issue , the cause may have something to do with our set up.

reinstall cluster may be something we've got to do next.

Q-wulf · Nov 23, 2015

Just try a manual Backup job from Sys3, before you go off a tangent

Worst Case: Its not it.
Best Case: it is it and you do not unnecessarily put work into a cluster-reinstall (which might very well have the same issue)

RobFantini · Nov 23, 2015

Hello,

I backed up 4 Vm's at sys3 , and all stayed green.

Q-wulf · Nov 23, 2015

Via automatic (sheduled) backups ?? or by Hand ?
Basically replicate the Backup-Procedures from your Nightly sheduled Backup 1:1. Then just run it at currentTime+1Minute instead of the sheduled time.

RobFantini · Nov 23, 2015

Q-wulf said:
Via automatic (sheduled) backups ?? or by Hand ?
Basically replicate the Backup-Procedures from your Nightly sheduled Backup 1:1. Then just run it at currentTime+1Minute instead of the sheduled time.

The backup was done from pve web cron scheduler, current time + 2 minutes. Did that twice for the 2 vm's.

RobFantini · Nov 23, 2015

on the prior sys5 backup 11/21 the speed was much faster .

Code:

  109: Nov 21 22:27:19 INFO: transferred 159995 MB in 1636 seconds (97 MB/s)
  109: Nov 23 00:42:18 INFO: transferred 159995 MB in 9735 seconds (16 MB/s)

  1747: Nov 21 22:46:12 INFO: transferred 39728 MB in 609 seconds (65 MB/s)
  1747: Nov 23 03:46:37 INFO: transferred 39728 MB in 9308 seconds (4 MB/s)

  3902: Nov 21 22:48:03 INFO: transferred 8589 MB in 20 seconds (429 MB/s)
  3902: Nov 23 03:53:18 INFO: transferred 8589 MB in 81 seconds (106 MB/s)

vm 109 was the 1-st to be backed up. So what ever the issue it started before the backup.

on 11/21 all nodes were green after the backup.
on 11/23 we had red issue at two nodes during or after the backup.

So something is up before the backup even started.

Just checked syslog at sys5 for that time:

Code:

Nov 22 22:00:02 sys5 vzdump[23130]: INFO: Starting Backup of VM 109 (qemu)
Nov 22 22:00:03 sys5 qm[23133]: <root@pam> update VM 109: -lock backup
Nov 22 22:02:20 sys5 corosync[8309]:  [MAIN  ] Corosync main process was not scheduled for 6762.6118 ms (threshold is 1320.0000 ms). Consider token timeout increase.
Nov 22 22:02:20 sys5 corosync[8309]:  [TOTEM ] A processor failed, forming new configuration.
Nov 22 22:02:20 sys5 pve-firewall[8317]: firewall update time (5.092 seconds)
Nov 22 22:02:20 sys5 corosync[8309]:  [TOTEM ] A new membership (10.2.8.19:14180) was formed. Members joined: 1 3 left: 1 3
Nov 22 22:02:20 sys5 corosync[8309]:  [TOTEM ] Failed to receive the leave message. failed: 1 3
Nov 22 22:02:20 sys5 corosync[8309]:  [QUORUM] Members[3]: 4 1 3
Nov 22 22:02:20 sys5 corosync[8309]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 22 22:02:26 sys5 pvestatd[23475]: status update time (21.081 seconds)
Nov 22 22:02:26 sys5 pmxcfs[16758]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/107: -1
Nov 22 22:02:26 sys5 pmxcfs[16758]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/5544: -1
Nov 22 22:02:26 sys5 pmxcfs[16758]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/3103: -1

so 2:20 into the backup the issue started

and here is part of the backup log:

Code:

vzdump --mailnotification always --node sys5 --storage nfs-pve --mode snapshot --exclude 77904,8102 --mailto fbcadmin --all 1 --quiet 1 --compress lzo

109: Nov 22 22:00:02 INFO: Starting Backup of VM 109 (qemu)
109: Nov 22 22:00:02 INFO: status = running
109: Nov 22 22:00:03 INFO: update VM 109: -lock backup
109: Nov 22 22:00:03 INFO: backup mode: snapshot
109: Nov 22 22:00:03 INFO: ionice priority: 7
109: Nov 22 22:00:03 INFO: creating archive '/mnt/pve/nfs-pve/dump/vzdump-qemu-109-2015_11_22-22_00_02.vma.lzo'
109: Nov 22 22:00:03 INFO: started backup task '3a6ca23d-a2de-4af9-8b26-b70d17228357'
109: Nov 22 22:00:06 INFO: status: 0% (231866368/159995920384), sparse 0% (5517312), duration 3, 77/75 MB/s
109: Nov 22 22:00:59 INFO: status: 1% (1604059136/159995920384), sparse 0% (26710016), duration 56, 25/25 MB/s
109: Nov 22 22:04:03 INFO: status: 2% (3265134592/159995920384), sparse 0% (47767552), duration 240, 9/8 MB/s
109: Nov 22 22:05:21 INFO: status: 3% (4838785024/159995920384), sparse 0% (237436928), duration 318, 20/17 MB/s
109: Nov 22 22:08:21 INFO: status: 4% (6504251392/159995920384), sparse 0% (270823424), duration 498, 9/9 MB/s
109: Nov 22 22:10:09 INFO: status: 5% (8088715264/159995920384), sparse 0% (317431808), duration 606, 14/14 MB/s
109: Nov 22 22:10:17 INFO: status: 6% (9725870080/159995920384), sparse 0% (1478684672), duration 614, 204/59 MB/s
109: Nov 22 22:10:22 INFO: status: 7% (11249254400/159995920384), sparse 1% (2859257856), duration 619, 304/28 MB/s
109: Nov 22 22:10:27 INFO: status: 8% (13103464448/159995920384), sparse 2% (4700565504), duration 624, 370/2 MB/s
109: Nov 22 22:10:32 INFO: status: 9% (14650376192/159995920384), sparse 3% (6091571200), duration 629, 309/31 MB/s
109: Nov 22 22:10:42 INFO: status: 10% (16145514496/159995920384), sparse 4% (7089979392), duration 639, 149/49 MB/s
109: Nov 22 22:10:57 INFO: status: 11% (17610440704/159995920384), sparse 5% (8027828224), duration 654, 97/35 MB/s
109: Nov 22 22:11:23 INFO: status: 12% (19201916928/159995920384), sparse 5% (8063328256), duration 680, 61/59 MB/s
109: Nov 22 22:16:09 INFO: status: 13% (20822360064/159995920384), sparse 5% (8098742272), duration 966, 5/5 MB/s
109: Nov 22 22:16:54 INFO: status: 14% (22427140096/159995920384), sparse 5% (8122384384), duration 1011, 35/35 MB/s
109: Nov 22 22:19:38 INFO: status: 15% (24077008896/159995920384), sparse 5% (8151613440), duration 1175, 10/9 MB/s
109: Nov 22 22:21:33 INFO: status: 16% (25710559232/159995920384), sparse 5% (8177446912), duration 1290, 14/13 MB/s
109: Nov 22 22:23:03 INFO: status: 17% (27205632000/159995920384), sparse 5% (8200376320), duration 1380, 16/16 MB/s
109: Nov 22 22:23:54 INFO: status: 18% (28811853824/159995920384), sparse 5% (8229777408), duration 1431, 31/30 MB/s
109: Nov 22 22:25:47 INFO: status: 19% (30447173632/159995920384), sparse 5% (8254017536), duration 1544, 14/14 MB/s

in the cronscripts at other systems - I try to do all other backups at different time blocks from vzdump.

RobFantini · Nov 24, 2015

red again. pve-zsync had just started:

Code:

Nov 23 15:50:01 dell1 CRON[25215]: (root) CMD (pve-zsync sync --source 4526 --dest 10.2.2.46:tank/pve-zsync-bkup --name etherpad-syncjob --maxsnap 12 --met
hod ssh)
Nov 23 15:50:01 dell1 CRON[25217]: (root) CMD (pve-zsync sync --source 3106 --dest 10.2.2.46:tank/pve-zsync-bkup --name mediawiki-syncjob --maxsnap 12 --me
thod ssh)
Nov 23 15:50:01 dell1 CRON[25218]: (root) CMD (pve-zsync sync --source 3122 --dest 10.2.2.46:tank/pve-zsync-bkup --name ona-syncjob --maxsnap 12 --method s
sh)
Nov 23 15:50:01 dell1 CRON[25216]: (root) CMD (pve-zsync sync --source 101  --dest 10.2.2.46:tank/pve-zsync-bkup --name ldap-syncjob --maxsnap 12 --method 
ssh)
Nov 23 15:50:01 dell1 CRON[25220]: (root) CMD (pve-zsync sync --source 3551 --dest 10.2.2.46:tank/pve-zsync-bkup --name nodejs-syncjob    --maxsnap 12 --me
thod ssh)
Nov 23 15:50:01 dell1 CRON[25219]: (root) CMD (pve-zsync sync --source 4501 --dest 10.2.2.46:tank/pve-zsync-bkup --name pro4-ray-syncjob  --maxsnap 48 --me
thod ssh)
Nov 23 15:50:38 dell1 pveproxy[27822]: worker exit
Nov 23 15:50:38 dell1 pveproxy[9058]: worker 27822 finished
Nov 23 15:50:38 dell1 pveproxy[9058]: starting 1 worker(s)
Nov 23 15:50:38 dell1 pveproxy[9058]: worker 26069 started
Nov 23 15:50:42 dell1 corosync[7713]:  [TOTEM ] A processor failed, forming new configuration.
Nov 23 15:50:42 dell1 corosync[7713]:  [TOTEM ] A new membership (10.2.8.19:15060) was formed. Members

Looks like will need to limit bandwith and use ionice on pve-zsync

flowbergit · Nov 24, 2015

I've got the same issues but after latest update from no-pve-repository it's working stable

RobFantini · Nov 25, 2015

Solutions:
I was able to trace down the cause of issue here.

In each case prior to going red there was heavy traffic to nfs storage.

This gets fixed by limiting traffic in configuration of vzdump , pve-zsync and other scripts we use to send backups to nfs. Also scheduling the jobs so they do not occur at the same time.

in addition I started to use rsyslog to tell me when an issue starts.

* /etc/rsyslog.d/corosync-watch.conf :

Code:

$ModLoad ommail

$ActionMailSMTPServer localhost
$ActionMailFrom       rsyslog@myplace.com
$ActionMailTo         someone@myplace.com

$template mailSubject,"A processor failed line in syslog  on %hostname%"
$template    mailBody,"Check pve web pages, they may be red.\r\n\r\n%msg%"

$ActionMailSubject mailSubject

# Only send an email every 15 minutes
# $ActionExecOnlyOnceEveryInterval 900

# This if/then must all be on one line
if $msg contains 'A processor failed' then :ommail:;mailBody

Prioritizing traffic at managed switch, using a separate network for pve backups also helped.

thank you for the suggestions and help.

Q-wulf · Nov 25, 2015

I wonder why the separated Vlans for every Service and prioritising them on the switch-level did not do the trick.
Any chance one of the services you mentioned above did not run in its own Vlan and or used higher VLan Prio then the corosync Vlan ?

Good thing its fixed (took long enough

)

RobFantini · Nov 25, 2015

Q-wulf said:
I wonder why the separated Vlans for every Service and prioritising them on the switch-level did not do the trick.
Any chance one of the services you mentioned above did not run in its own Vlan and or used higher VLan Prio then the corosync Vlan ?

Good thing its fixed (took long enough )

I like typing and creating long threads, it is good you like to read and respond ;-)

The strange thing is that "pcevm status" from cli aways showed the cluster as being OK. So I think that corosync / pve* services still needs adjusting for default configuration at install. And it is better to have a better understanding of cluster before using then I have.

I'm not an expert on layer 3 switches, they can be complicated beasts. We changed from netgear to cisco because documentation and forum are a lot better. I read a lot and got good responses at the Cisco forum. With the sg-300 set as layer 3, vlan prioritize is not possible, however per port CoS/802.1p can be setup. It is simple - set queue priority per port and plugging the cable into the correct port, setting up the vlan per port. So the corosync network was setup to have the same priority as our very busy phone system. there are never dropped phone calls or any issues at all with our phone system. corosync was on its own vlan so multicasts should be limited to that vlan.

Q-wulf · Nov 25, 2015

yeah, unless you have a nic for every single vlan (in your case that would be ...5 ??), that is rather hard (at some point you run out of nics) to do. Also feels tedious

I currently have around 80 vlans in my network , with up to 40 OVS_intports (in different Vlans) leaving over a single Bond... think about that

RobFantini · Nov 25, 2015

Q-wulf said:
yeah, unless you have a nic for every single vlan (in your case that would be ...5 ??), that is rather hard (at some point you run out of nics) to do. Also feels tedious

I currently have around 80 vlans in my network , with up to 40 OVS_intports (in different Vlans) leaving over a single Bond... think about that

we were using 5 nics when corosync was set to its' own vlan. Now down to 4 nics and 15 vlans..

How many vm's do you run per node?

What is you backup strategy ?

Q-wulf · Nov 25, 2015

Depends. We have 6 Proxmox-Clusters with 10+ nodes each (74 total presently), they have different CPU/Ram sets and are used to do different stuff.
On top of those we run 3 different Ceph Clusters, that have nodes in each of the proxmox clusters.
I've written about parts of it here:
http://forum.proxmox.com/threads/24388-NAS-Solution-for-backup?p=122791#post122791

Backup Strategy is here:
http://forum.proxmox.com/threads/24671-Nodes-going-red-**Solved**?p=123534#post123534

We cluster about everything that is even remotely critical (5+ instances), rest runs on HA.
We run around 130 infrastructure VM's
some 100 or so Business-Software VM's
some 20 or so internal Software VM's
some 200 or so Containers (based on employee/team workloads/needs)
what ever we need for testing on the IT-Front.

total on average 4 VM's + 4 containers. There are some nodes that do nothing beyond ceph and be on standby for a potential HA switchover, so there are some nodes that have higher VM/CT numbers. Check back into http://forum.proxmox.com/threads/24671-Nodes-going-red-**Solved**?p=123534#post123534 that should give you an idea. Without openvswitch and a internal wiki keeping track of subnets and IP-assignments and virtual Macs i'd probably gone nuts by now.

some of the Data, like e.g. our Campus-Video-Surveilance archive does not get backed up , but instead sits on an a Erasure Coded ceph pool, with high parity numbers, spread over multiple data-rooms and buildings.

RobFantini · Dec 1, 2015

After a few days of no issues we have the issue again.

This time I set up a central log server.
So it is easier to debug logs by seeing syslog from all nodes together.

We had 3 of 4 nodes have the red symptom - only local host showed green.
there were no backups in progress. the only unusual thing was that I was working with console on a windows kvm about the same time.

On sys5 all nodes show green

/etc/pve was writable only from sys5

My question:
for a cluster , I do not think it is normal for one node to have /etc/pve writable and the others not.
Is that that true?

dietmar · Dec 1, 2015

RobFantini said:
for a cluster , I do not think it is normal for one node to have /etc/pve writable and the others not.
Is that that true?

No, this is strange.

RobFantini · Dec 1, 2015

dietmar said:
No, this is strange.

I thought so.

When I was setting up a separate corosync network, I think I had a typo in sys5's hosts file.. That may have left some damage.

So I think sys5 should be reinstalled.

Nodes going red

Famous Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Famous Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Proxmox Staff Member

Famous Member

We value your privacy