NAS-Related Performance Issues

Overview

CT: Imagine a situation in which the customer is complaining about slow backup speeds between the media server and the DXi. In addition, the customer says that their opimized duplication jobs are failing.

 

Unfortunately, in order to resolve NAS related performance issues we have to take on the role of the customer’s network administrator.  Some customers have great network admins and others are not as fortunate.  It is the latter who present the most challenge for our service teams when determining “poor performance” when it comes to their backup solution.  

Professionals make a living at only troubleshooting networks and troubleshooting NAS related performance issues from a Quantum service perspective is one of the most challenging things we encounter. The purpose of this performance troubleshooting page is to help us quickly identify bottlenecks without becoming an expert in all things networking.

 

CT: Do you want to include something like this for team members to use or will there be too much variance to provide any sort of template? You can use the template (MAKE THIS A DOWNLOADABLE DOC) to help you organize the output from your analysis.

 

Remember, you can use the Export as PDF option in the left-hand pane if you want to save a copy to your desktop for use at the customer site.

 


A Little Background

Network administrators use something called the OSI model to help them understand network issues and identify bottlenecks. From a Quantum perspective we’re really only interested in layers 1-3.  If you want to fully understand the layers in the OSI model Link will open in new window., read more about it on the internet by reviewing the following information:


Start at Layer 1: The Physical Layer

First the physical layer should be verified.  This can be done using the following steps:

  1. Remove any windows teaming that is involved.  (Customer will have to delete the team then configure a single interface with the desired IP info.)
  2. Disable ports or remove cables so that only a single interface is connected to backup server.
  3. Connect only a single cable to the DXi.
  4. Zero counters on the switch for respective ports. (Hopefully customer will know how to do this for their own switch.  If not, then search the internet for a product manual to understand how to zero counters.)
  5. Enable netperf from the GUI.
  6. Copy attached netperf windows version to the backup server.
  7. Run the following command to test throughput between backup server and DXi. Note, the -l 30 option means the test will run for 30 seconds.  30 can be changed to any value.
  8. The output from netclient should be 80-120 MB/s.  Rate can vary depending on what else is happening on the network.
  9. The switch ports that were zeroed out should not have any errors from the file transfer.
  10. If the output is below 80MB/s or there are errors on the switch ports, then use process of elimination to determine which piece of hardware is bad.  Do this by simply trying different ports, cables and NICs/ports.
  11. Repeat tests for any ports and cables to be used for aggregation.

Continue with Layer 2 and Layer 3

For layers 2 and 3, look at aggregation.  Testing layer 1 actually involved your typical layer 2 and 3, MAC and IP, communication.  We only care about these layers in the customer’s network when it comes to aggregation.

 

Cisco aggregation is referred to as Etherchannel Technology Link will open in new window.. Reading about Etherchannel Technology Link will open in new window. from Cisco will provide an understanding for most aggregation solutions from other vendors as well.  For the purposes of this document and solutions with a DXi system, keep these things in mind:


Troubleshooting

CT: Should these be listed as steps? In other words, is there an order here?

 

 


Common Concerns

The 80-120 MB/s range

There will always be customers who think 80MB/s is not adequate on a single 1 GbE connection but 85 is about our average when it comes to Windows.  Even when going direct there are a lot of things that play into this that need to be changed in the registry to be optimized.  In my experience I’ve always been able to find a system that can show higher throughput to the DXi somewhere on the network proving that it is an issue local to the backup server.

Customer will sometimes claim that DART is inaccurate

I have yet to see DART Ethernet statistics to be inaccurate.  With one customer I had to capture packets on the DXi using tcpdump then open the capture file in wireshark.  Wireshark numbers were identical to DART numbers.

 


Exit Strategy

It is easy to get caught up in a NAS related performance issue for weeks or months with a customer who usually thinks there is no problem with their network.  Our focus in service should always be helping the customer to identify the problem then hand off to them if we don’t have the resources to resolve it.

 

Using the processes above should make it easier to identify where the problem is. Showing adequate throughput between the backup server/s and the DXi would prove that the bottleneck is mostly like due to the network between the backup clients or is due to configuration of the backup software.

 

If the customer is not in agreement with the findings from above where adequate performance between the DXi and the backup server has been proven, then communicate this to management and an exit strategy will be discussed.


Conclusion

Using the troubleshooting methods in this guide will hopefully help us all save time when faced with challenging NAS performance issues. This guide lives on qwikipedia so that it can be contributed to by all service members. There is still a lot to be added such as Linux specific testing, CIFS vs NFS and more educational links.  For now, please send any ideas, comments or suggestions to me at ryan.davies@quantum.com

 


 

Notes

Per Dale Britton's request, we need to consider adding something about switch configuration (recommended parameters) and suggested troubleshooting steps.

Note by Charlotte Taylor on 03/30/2011 12:33 PM

Ryan,

 

I documented some usage of some generic networking commands that I thought might be useful including in guiding some troubleshooting steps in the Qwiki article. I confess I may have pulled these from another source (maybe DXi Wiki) but not sure if I could give proper credit if we need to do that.  I also don't know if they are complete, but it's a start and I wanted to suggest them.  Please feel free to add at your descretion if they are useful. I envision them to sit in maybe a "Usage of Commonly used Networking Commands" section somehwere between the Layer 1 and the Layer2-3 sections.

 

 

##### DXi network port requirements #####

BAllen Note: ASPS typically requests customers to verify the following ports are open, but I can;t say if it is complete::

port 80    # (TCP,UDP Hypertext Transfer Protocol (HTTP)
port 443  # (TCP,UDP Hypertext Transfer Protocol over TLS/SSL (HTTPS) )
port 21    # (ftp)
port 22    # (TCP,UDP Secure Shell (SSH)—used for secure logins, file transfers (scp, sftp) and port forwarding)
port 23    # (TCP Telnet protocol—unencrypted text communications)

            BAllen Note: This one might be as needed. I think this is typically closed normally

#### netstat ####


The netstat command can be used to confirm what ports are listening/ESTABLISHED:

 

netstat -anp | grep :80 | more
netstat -anp | grep :443 | more
netstat -anp | grep :22 | more

Example output:

[root@ats10 ~]# netstat -anp | grep :22 | more
tcp        0      0 :::22                       :::*                        LISTEN      15535/sshd
tcp        0      0 ::ffff:10.20.218.10:22      ::ffff:10.20.218.94:2158    ESTABLISHED 6797/sshd: cliadmin
tcp        0      0 ::ffff:10.20.218.10:22      ::ffff:10.20.128.103:1495   ESTABLISHED 12048/2
tcp        0      0 ::ffff:10.20.218.10:22      ::ffff:10.20.218.77:1184    ESTABLISHED 21282/sshd: cliadmi

#### ifconfig ####

 

Use ifconfig to verify ethernet port config and status

Example output

[root@ats10 ~]# ifconfig
bond0     Link encap:Ethernet  HWaddr 00:30:48:34:52:78
          inet addr:127.0.0.1  Bcast:127.0.0.255  Mask:255.255.255.0
          inet6 addr: fe80::230:48ff:fe34:5278/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:50757395 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30519835 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:43096660280 (40.1 GiB)  TX bytes:8579879696 (7.9 GiB)

bond0:2   Link encap:Ethernet  HWaddr 00:30:48:34:52:78
          inet addr:10.20.218.10  Bcast:10.20.219.255  Mask:255.255.252.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

eth0      Link encap:Ethernet  HWaddr 00:30:48:34:52:78
          inet6 addr: fe80::230:48ff:fe34:5278/64 Scope:Link
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:50756117 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30519151 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          RX bytes:43096546607 (40.1 GiB)  TX bytes:8579399611 (7.9 GiB)
          Base address:0x4000 Memory:d8000000-d8020000

eth1      Link encap:Ethernet  HWaddr 00:30:48:34:52:78
          inet6 addr: fe80::230:48ff:fe34:5278/64 Scope:Link
          UP BROADCAST SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:1278 errors:0 dropped:0 overruns:0 frame:0
          TX packets:684 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:113673 (111.0 KiB)  TX bytes:480085 (468.8 KiB)
          Base address:0x4020 Memory:d8020000-d8040000

eth2      Link encap:Ethernet  HWaddr 00:30:48:9C:DC:79
          inet addr:10.17.21.1  Bcast:10.17.21.255  Mask:255.255.255.0
          inet6 addr: fe80::230:48ff:fe9c:dc79/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:21531079 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21242841 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:4684304016 (4.3 GiB)  TX bytes:1535090617 (1.4 GiB)
          Base address:0x7400 Memory:d8600000-d8620000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:148903508 errors:0 dropped:0 overruns:0 frame:0
          TX packets:148903508 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:24996461183 (23.2 GiB)  TX bytes:24996461183 (23.2 GiB)

##### traceroute #####

Use traceroute to verify path between source/target IP's

Example output:

From the Source DXi

traceroute -s <sourceDxiIPAddress> <TgtDXiIPAddress>

traceroute -s 10.20.218.10 10.20.218.20
traceroute to 10.20.218.20 (10.20.218.20) from 10.20.218.10, 30 hops max, 46 byte packets
 1  10.20.218.20 (10.20.218.20)  0.259 ms  0.384 ms  0.474 ms

From the Target DXi

traceroute -s <TgtDXiIPAddress> <sourceDxiIPAddress>

Areas of interest:

traceroute should complete with reasonable timeframe with no indications of delay
If traceroute fails, you will see it try 30 times (default) and output similar to below:

[root@ats10 ~]# traceroute -s 10.20.218.10 10.20.200.20
traceroute to 10.20.200.20 (10.20.200.20) from 10.20.218.10, 30 hops max, 46 byte packets
 1  10.20.216.1 (10.20.216.1)  0.331 ms  0.319 ms  0.224 ms
 2  * * *
 3  * * *
 4  * * *
 5  * * *
..
..
30 ***

Recreate failure going both ways using traceroute between the source and target to get a sense of where the traceroute stops. This should help isolate what router on the network is blocking the trace.


#### ping ####

Use ping as high level verification that two hosts can communicate across a network

[root@ats10 ~]# ping 10.20.218.20
PING 10.20.218.20 (10.20.218.20) 56(84) bytes of data.
64 bytes from 10.20.218.20: icmp_seq=0 ttl=64 time=0.196 ms
64 bytes from 10.20.218.20: icmp_seq=1 ttl=64 time=0.404 ms
64 bytes from 10.20.218.20: icmp_seq=2 ttl=64 time=0.366 ms
64 bytes from 10.20.218.20: icmp_seq=3 ttl=64 time=0.327 ms

--- 10.20.218.20 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3000ms
rtt min/avg/max/mdev = 0.196/0.323/0.404/0.079 ms, pipe 2

Areas of interest are:

time values: should be quick
icmp_seq: ideally should be in order.
packet loss

Network bottlenecks can be suspect if time values look delayed and icmp_seq out of sequence.

 

Note by Bill Allen on 03/25/2011 02:11 PM
Attachments
Title Last Updated Updated By
Netperf for Windows
02/25/2011 02:18 PM Tom McFaul


This page was generated by the BrainKeeper Enterprise Wiki, © 2018