HOWTO: Networking: Verify that the LACP bonding is working in Linux

This information is from an escalation sent to SES from SPS regarding how to verify LACP bonding is working

from the Linux OS perspective.

Thanks to Ryan Davies/SES and Alain/SUS.

 

 

 Notice that Alain said the Partner Mac Address is null (00:00:00:00:00:00) when LACP isn’t working correctly. When this happens the aggregator IDs will also be different. It is possible, and not all that uncommon, for the partner MAC address to be populated but the aggregator IDs to still be different. This means that the switch is configured for LACP but the cables aren’t connected to the correct ports.

 

More information ---

There are two types of LACP modes, active and passive:

·        Active - Always send LACP data units (LACPDU) on the wire

·        Passive - Only respond to LACPDUs on the wire.

 

Our appliances, and I’m guessing that includes any Linux server, require that the switch LACP mode be set to active

 

1.      When an SN appliance or DXi boots with LACP configured, it starts sending LACPDUs on the wire.

2.      When the switch sees that the appliance is up, it also starts sending LACPDUs on the wire

3.      The aggregator is then negotiated, which is the whole purpose of LACP.

a.      It might help to think about LACP along the lines of DHCP. You can create a static aggregator, like a static IP, or you can use LACP and let it get created dynamically, like with DHCP.

 

It’s important to note that mode 0 (Balanced Round-Robin) is the option to use if a customer desires to use a static aggregator (equivalent of using a static IP). Unlike the case of using DHCP, it is recommended to use LACP. In fact, the argument could be made to not even provide customers with the option to use mode 0. There are too many customers who have been told by Quantum personnel over the years that mode 0 doesn’t require any configuration on the switch. Using mode 0 without a configuration on the switch is just like using LACP without configuration on the switch. We used to even state that in our documentation.

 

----------

Conclusion

1.      If the Partner Mac Address in /proc/net/bonding/bondX is populated AND the aggregator IDs are the same then LACP is working properly.

2.      If step 1 isn’t true or mode 0 is being used and there is suspicion of an aggregation mismatch, down all but one of the ports in the bond and see if the network problems cease.

a.      In my experience, a bond on a host or an aggregator on a switch with a single connection is no different than a single connection without the bond or aggregator. If you’re ever working an issue where you suspect an issue with aggregation, you can just down all but one of the ports and test.

3.      The easiest way to take down ports in a bond is to use ifdown. Example for a bond with eth4 and eth5 as slaves:

a.      ifdown eth5

4.      If aggregator IDs don’t match then I always ifdown the slaves with aggregators that DON’T match. I’ve never seen this interrupt any existing connections.

 

 

 

 

Ryan Davies | SES | Quantum Corporation | Mobile: +1 (208) 419-6548 | ryan.davies@quantum.com

 

From: Mamoon Ansari
Sent: Friday, May 12, 2017 8:24 AM
To: Jonathan McNerny; Alain Renaud; Jeff Syme
Cc: DL-SN-SES; DL-Service EMEA - SW; DL-KL ASPS Tech Support; DL-AMER-SPS; StorNext Sustaining Engineers; Joshua Martin
Subject: RE: Xcellis Workflow Director Escalation To SES From SPS: CASE0339896, Customer:CBS Corp -- how to verify that the LACP bond is OK in Linux vs. enet switch, Serial Number: AV1638CKH00220

 

Along with what Alain said, you also want to check the “Aggregator Ids” (per Ryan Davies). They should be same for slave interfaces.

 

Example: below are not same. Houston, we have a problem:

 

802.3ad info

LACP rate: slow

Aggregator selection policy (ad_select): stable Active Aggregator Info:

               Aggregator ID: 1

               Number of ports: 1

               Actor Key: 17

               Partner Key: 1

               Partner Mac Address: 00:00:00:00:00:00

 

Slave Interface: eth2

MII Status: up

Speed: 1000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: d4:ae:52:88:08:bd

Aggregator ID: 1

Slave queue ID: 0

 

Slave Interface: eth3

MII Status: up

Speed: 1000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: d4:ae:52:88:08:bf

Aggregator ID: 2

Slave queue ID: 0

Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009

 

Thanks,

========================================================================================================================================================

Mamoon R. Ansari | Technical Support Engineer,  Software Product Support  |  Office: 630.640.1025mamoon.ansari@quantum.com |  Quantum.com | Support: 800.827.3822

be-certain_lockup.gif

 

From: Jonathan McNerny
Sent: Friday, May 12, 2017 8:58 AM
To: Alain Renaud; Jeff Syme
Cc: DL-SN-SES; DL-Service EMEA - SW; DL-KL ASPS Tech Support; DL-AMER-SPS; StorNext Sustaining Engineers; Joshua Martin
Subject: RE: Xcellis Workflow Director Escalation To SES From SPS: CASE0339896, Customer:CBS Corp -- how to verify that the LACP bond is OK in Linux vs. enet switch, Serial Number: AV1638CKH00220

 

The other day I found that we have ‘iftop’ on both Centos6/7, which is cool because you can do lots of filtering as with tcpdump, but iftop is centered around bandwidth monitoring.

 

So, if you know the NICs in the bond….

 

[root@cx-node1 ~]# ip addr show | grep master

5: eth2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000

6: eth3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000

 

You can monitor each port in separate shells.

 

iftop -B -i eth2

iftop -B -i eth3

 

Then just pay attention to the stats reported on the bottom of the live top like output.

#eth2

 

#eth3

 

#client side

 

Jon

 

 

Quantum

Jonathan McNerny - Software Product Support
720.249.3916 | Jonathan.McNerny@Quantum.com | Quantum.com

 

From: Alain Renaud
Sent: Thursday, May 11, 2017 4:34 PM
To: Jeff Syme
Cc: DL-SN-SES; DL-Service EMEA - SW; DL-KL ASPS Tech Support; DL-AMER-SPS; StorNext Sustaining Engineers; Joshua Martin
Subject: Re: Xcellis Workflow Director Escalation To SES From SPS: CASE0339896, Customer:CBS Corp -- how to verify that the LACP bond is OK in Linux vs. enet switch, Serial Number: AV1638CKH00220

 

normally when a LACP bound is correctly set you should see something like this on the linux side. Note that usually when it is not configured properly the Partner Mac Address is set to 00:00:00:00:00

 

 

 

# more /proc/net/bonding/bond1 

Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

 

Bonding Mode: IEEE 802.3ad Dynamic link aggregation

Transmit Hash Policy: layer3+4 (1)

MII Status: up

MII Polling Interval (ms): 500

Up Delay (ms): 0

Down Delay (ms): 0

 

802.3ad info

LACP rate: slow

Aggregator selection policy (ad_select): stable

Active Aggregator Info:

     Aggregator ID: 1

     Number of ports: 2

     Actor Key: 33

     Partner Key: 510

     Partner Mac Address: 00:23:04:ee:be:02

 

Slave Interface: eth4

MII Status: up

Speed: 10000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: a0:36:9f:55:3a:38

Aggregator ID: 1

Slave queue ID: 0

 

Slave Interface: eth5

MII Status: up

Speed: 10000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: a0:36:9f:55:3a:3a

Aggregator ID: 1

Slave queue ID: 0

 

 

Another option is to look at the ifconfig output of both interface in the bound and confirm that they are somewhat balance. Well at least that there is not one close to zero. 

 

# ifconfig eth4

eth4      Link encap:Ethernet  HWaddr A0:36:9F:30:A5:F0  

          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1

          RX packets:280918667889 errors:0 dropped:105908 overruns:0 frame:0

          TX packets:198236148255 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000 

          RX bytes:364072984209121 (331.1 TiB)  TX bytes:242784925987710 (220.8 TiB)

 

# ifconfig eth5

eth5      Link encap:Ethernet  HWaddr A0:36:9F:30:A5:F0  

          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1

          RX packets:100424527061 errors:0 dropped:151830 overruns:0 frame:0

          TX packets:194613416278 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000 

          RX bytes:111270973334344 (101.2 TiB)  TX bytes:235191060763130 (213.9 TiB)

 

 

On May 11, 2017, at 18:19, jeff.syme@quantum.com wrote:

 

Case ID               : 0339896
Company Name          : CBS Corp
StorNext Serial Number: AV1638CKH00220
Summary               : how to verify that the LACP bond is OK in Linux vs. enet switch
Escalated On:         : May 11, 2017 16:19:09 MDT

Severity              : Minor
Customer Temp         : Normal
Site Status           : Normal
New Install           : Yes
Has it worked before  : No
Secure site           : Yes
HA System             : Yes
Medicus Site          : No

Suspected Bugs        : None
Logs Location         : http://susrepo/ticketinfo/SR03xxxxx/SR033xxxx/SR0339xxx/SR0339896
Your Email Address    : jeff.syme@quantum.com

Contract Coverage     : CRU/FRU GOLD Std Service Model
Assign to SES Person  : next available

----------------------
Installation Details

  StorNext Platform     : Xcellis Workflow Director
  StorNext OS           : CentOS-RH7.2 equivalent (3.10.0.327.4.5.EL7)
  StorNext Version      : 5.3.1
  StorNext Build        : 00000

  Client Platform       :
  Client OS             :
  Client Version        :
  Client Build          :
----------------------

Problem is on         : MDC
Date/Time Occurred    : ongoing

Description           :
Isn’t there some way to verify that the bond is OK in Linux and that it’s happy with the config from the switch?

I know that about 6 months ago there was an issue with some M440’s that I deployed that was complaining about the LACP config because it was correct on the MDC’s but not the switches


According to my Cisco switches, the LACP bond is functional, I just want to be absolutely sure that the Xcellus nodes are happy with the network config.

Mitch Spacone
CBS Television City | Media Maintenance
7800 Beverly Blvd, Los Angeles, CA 90036
Mitch.spacone@cbs.com
Desk: 323.575.4141
Mobile: 310.347.7771

What has been done and results:
Ron Housman and I wrestled with a # vip_control being bound to the wrong bond in case 339895 | CBS | XCELLIS / HA_VIP not functional <now resolved>

Mitch is asking for two different serial numbers: an Xcellis and a M440

Questions             :
how to verify that the LACP bond is OK in Linux vs. enet switch

----------------------
Configuration Changes

  Any recent changes?   : Yes
  Stornext              : new install, left over from PS engineer
  System SW             : No changes
  System HW             : No changes
  Network               : No changes
  RAID                  : No changes
----------------------

This email was generated by: http://denlgservicev1.quantum.com/cgi-bin/escalations/index.cgi

 


Alain Renaud
 | Senior Sustaining Engineer| Quantum  | VOIP: 612-567-4680 | Alain.Renaud@quantum.com

quantum.com/BeCertain

 

 



This page was generated by the BrainKeeper Enterprise Wiki, © 2018