NAS-Related Performance Issues

Overview

CT: Imagine a situation in which the customer is complaining about slow backup speeds between the media server and the DXi. In addition, the customer says that their opimized duplication jobs are failing.

Unfortunately, in order to resolve NAS related performance issues we have to take on the role of the customer’s network administrator. Some customers have great network admins and others are not as fortunate. It is the latter who present the most challenge for our service teams when determining “poor performance” when it comes to their backup solution.

Professionals make a living at only troubleshooting networks and troubleshooting NAS related performance issues from a Quantum service perspective is one of the most challenging things we encounter. The purpose of this performance troubleshooting page is to help us quickly identify bottlenecks without becoming an expert in all things networking.

CT: Do you want to include something like this for team members to use or will there be too much variance to provide any sort of template? You can use the template (MAKE THIS A DOWNLOADABLE DOC) to help you organize the output from your analysis.

Remember, you can use the Export as PDF option in the left-hand pane if you want to save a copy to your desktop for use at the customer site.

A Little Background

Network administrators use something called the OSI model to help them understand network issues and identify bottlenecks. From a Quantum perspective we’re really only interested in layers 1-3. If you want to fully understand the layers in the OSI model , read more about it on the internet by reviewing the following information:

Layer 1 = physical . (Cable, NIC, NIC ports, Switch ports, etc…)
Layer 2 = mac address
Layer 3 = IP address

Start at Layer 1: The Physical Layer

First the physical layer should be verified. This can be done using the following steps:

Remove any windows teaming that is involved. (Customer will have to delete the team then configure a single interface with the desired IP info.)
Disable ports or remove cables so that only a single interface is connected to backup server.
Connect only a single cable to the DXi.
Zero counters on the switch for respective ports. (Hopefully customer will know how to do this for their own switch. If not, then search the internet for a product manual to understand how to zero counters.)
Enable netperf from the GUI.
Copy attached netperf windows version to the backup server.
Run the following command to test throughput between backup server and DXi. Note, the -l 30 option means the test will run for 30 seconds. 30 can be changed to any value.
- netclient.exe -H 10.20.230.25 -f M -l 30 -- -s 64000,64000
The output from netclient should be 80-120 MB/s. Rate can vary depending on what else is happening on the network.
The switch ports that were zeroed out should not have any errors from the file transfer.
If the output is below 80MB/s or there are errors on the switch ports, then use process of elimination to determine which piece of hardware is bad. Do this by simply trying different ports, cables and NICs/ports.
Repeat tests for any ports and cables to be used for aggregation.

Continue with Layer 2 and Layer 3

For layers 2 and 3, look at aggregation. Testing layer 1 actually involved your typical layer 2 and 3, MAC and IP, communication. We only care about these layers in the customer’s network when it comes to aggregation.

Cisco aggregation is referred to as Etherchannel Technology . Reading about Etherchannel Technology from Cisco will provide an understanding for most aggregation solutions from other vendors as well. For the purposes of this document and solutions with a DXi system, keep these things in mind:

Aggregation only works when throughput can exceed a single channel (IP connection).
Out of the box, this means that a customer with a single backup server will never see more than 120MB/s and the average is 85 MB/s total aggregate throughput.

There is a common misunderstanding that aggregating 2 ports from the backup server then aggregating 2 ports from the DXi would double or even increase throughput. While this may be possible with specific switches it is not the norm and should not be communicated to the customer.
We have successfully increased throughput between a single backup server and a DXi6520 by creating multiple virtual IPs on the DXi, using LACP on both the customer’s backup server and the DXi and then configuring the backup application to use the multiple IP addresses. his created multiple IP connections which resulted in multiple channels to be split by LACP. While this is possible it is not yet supported or widely known. The example is in this document to make aware that creativity can increase throughput when necessary. This will be available out of the box with mininet in 2.x but for now requires changes to system files that do not survive FW upgrades.
Instead of discussing the approach from the last bullet, an additional backup server should be recommended. Unfortunately, we haven’t done a good job of communicating numbers on a per backup server basis and the previous bullet may become necessary for political reasons.

Troubleshooting

CT: Should these be listed as steps? In other words, is there an order here?

Now that all physical issues have been verified, configure aggregation on the DXi.
LACP was discouraged from engineering but most of our customers use it as their aggregation standard.
Start with LACP unless customer states otherwise.
Hopefully customer knows how to create an aggregation group (it is called an etherchannel on Cisco systems)
Make sure the hash algorithm, load-balance mode , for the switch is set properly. See examples in the next bullet.
If not then search the vendor’s website for documentation on how to create an LACP configuration. Click here for a tutorial on finding the right guide and creating an LACP etherchannel.
Use a connection in the group for each endpoint (backup server).
Once an aggregation group is created on the DXi use the netperf testing sequence from the Layer 1 testing.
If you have multiple endpoints (multiple backup servers) test using netperf from all of them at the same time. They should all be able to still get their 80-120MB/s while they are all running netperf concurrently.
If total aggregate throughput from all endpoints isn’t 80-120MB/s per endpoint then there is probably a problem with the LACP configuration on the switch.
The only problem that could be possible at this point on the DXi is a corrupted driver or an OS problem. Both of which should show up in the messages file. Appropriate action is apparent if this is the case.
Attempt different aggregation configurations on the switch or ask the customer to engage someone more specialized to resolve the issue.

Common Concerns

The 80-120 MB/s range

There will always be customers who think 80MB/s is not adequate on a single 1 GbE connection but 85 is about our average when it comes to Windows. Even when going direct there are a lot of things that play into this that need to be changed in the registry to be optimized. In my experience I’ve always been able to find a system that can show higher throughput to the DXi somewhere on the network proving that it is an issue local to the backup server.

Customer will sometimes claim that DART is inaccurate

I have yet to see DART Ethernet statistics to be inaccurate. With one customer I had to capture packets on the DXi using tcpdump then open the capture file in wireshark. Wireshark numbers were identical to DART numbers.

Exit Strategy

It is easy to get caught up in a NAS related performance issue for weeks or months with a customer who usually thinks there is no problem with their network. Our focus in service should always be helping the customer to identify the problem then hand off to them if we don’t have the resources to resolve it.

Using the processes above should make it easier to identify where the problem is. Showing adequate throughput between the backup server/s and the DXi would prove that the bottleneck is mostly like due to the network between the backup clients or is due to configuration of the backup software.

If the customer is not in agreement with the findings from above where adequate performance between the DXi and the backup server has been proven, then communicate this to management and an exit strategy will be discussed.

Conclusion

Using the troubleshooting methods in this guide will hopefully help us all save time when faced with challenging NAS performance issues. This guide lives on qwikipedia so that it can be contributed to by all service members. There is still a lot to be added such as Linux specific testing, CIFS vs NFS and more educational links. For now, please send any ideas, comments or suggestions to me at ryan.davies@quantum.com

Per Dale Britton's request, we need to consider adding something about switch configuration (recommended parameters) and suggested troubleshooting steps.

Note by Charlotte Taylor on 03/30/2011 12:33 PM

Ryan,

I documented some usage of some generic networking commands that I thought might be useful including in guiding some troubleshooting steps in the Qwiki article. I confess I may have pulled these from another source (maybe DXi Wiki) but not sure if I could give proper credit if we need to do that. I also don't know if they are complete, but it's a start and I wanted to suggest them. Please feel free to add at your descretion if they are useful. I envision them to sit in maybe a "Usage of Commonly used Networking Commands" section somehwere between the Layer 1 and the Layer2-3 sections.

##### DXi network port requirements #####

BAllen Note: ASPS typically requests customers to verify the following ports are open, but I can;t say if it is complete::

port 80    # (TCP,UDP Hypertext Transfer Protocol (HTTP)
port 443 # (TCP,UDP Hypertext Transfer Protocol over TLS/SSL (HTTPS) )
port 21    # (ftp)
port 22    # (TCP,UDP Secure Shell (SSH)—used for secure logins, file transfers (scp, sftp) and port forwarding)
port 23    # (TCP Telnet protocol—unencrypted text communications)

BAllen Note: This one might be as needed. I think this is typically closed normally

#### netstat ####

The netstat command can be used to confirm what ports are listening/ESTABLISHED:

netstat -anp | grep :80 | more
netstat -anp | grep :443 | more
netstat -anp | grep :22 | more

Example output:

[root@ats10 ~]# netstat -anp | grep :22 | more
tcp        0      0 :::22                       :::*                        LISTEN      15535/sshd
tcp        0      0 ::ffff:10.20.218.10:22      ::ffff:10.20.218.94:2158    ESTABLISHED 6797/sshd: cliadmin
tcp        0      0 ::ffff:10.20.218.10:22      ::ffff:10.20.128.103:1495   ESTABLISHED 12048/2
tcp        0      0 ::ffff:10.20.218.10:22      ::ffff:10.20.218.77:1184    ESTABLISHED 21282/sshd: cliadmi

#### ifconfig ####

Use ifconfig to verify ethernet port config and status

Example output

[root@ats10 ~]# ifconfig
bond0     Link encap:Ethernet HWaddr 00:30:48:34:52:78
          inet addr:127.0.0.1 Bcast:127.0.0.255 Mask:255.255.255.0
          inet6 addr: fe80::230:48ff:fe34:5278/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:50757395 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30519835 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:43096660280 (40.1 GiB) TX bytes:8579879696 (7.9 GiB)

bond0:2   Link encap:Ethernet HWaddr 00:30:48:34:52:78
          inet addr:10.20.218.10 Bcast:10.20.219.255 Mask:255.255.252.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

eth0      Link encap:Ethernet HWaddr 00:30:48:34:52:78
          inet6 addr: fe80::230:48ff:fe34:5278/64 Scope:Link
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:50756117 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30519151 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          RX bytes:43096546607 (40.1 GiB) TX bytes:8579399611 (7.9 GiB)
          Base address:0x4000 Memory:d8000000-d8020000

eth1      Link encap:Ethernet HWaddr 00:30:48:34:52:78
          inet6 addr: fe80::230:48ff:fe34:5278/64 Scope:Link
          UP BROADCAST SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:1278 errors:0 dropped:0 overruns:0 frame:0
          TX packets:684 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:113673 (111.0 KiB) TX bytes:480085 (468.8 KiB)
          Base address:0x4020 Memory:d8020000-d8040000

eth2      Link encap:Ethernet HWaddr 00:30:48:9C:DC:79
          inet addr:10.17.21.1 Bcast:10.17.21.255 Mask:255.255.255.0
          inet6 addr: fe80::230:48ff:fe9c:dc79/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:21531079 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21242841 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:4684304016 (4.3 GiB) TX bytes:1535090617 (1.4 GiB)
          Base address:0x7400 Memory:d8600000-d8620000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:148903508 errors:0 dropped:0 overruns:0 frame:0
          TX packets:148903508 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:24996461183 (23.2 GiB) TX bytes:24996461183 (23.2 GiB)

##### traceroute #####

Use traceroute to verify path between source/target IP's

Example output:

From the Source DXi

traceroute -s <sourceDxiIPAddress> <TgtDXiIPAddress>

traceroute -s 10.20.218.10 10.20.218.20
traceroute to 10.20.218.20 (10.20.218.20) from 10.20.218.10, 30 hops max, 46 byte packets
1 10.20.218.20 (10.20.218.20) 0.259 ms 0.384 ms 0.474 ms

From the Target DXi

traceroute -s <TgtDXiIPAddress> <sourceDxiIPAddress>

Areas of interest:

traceroute should complete with reasonable timeframe with no indications of delay
If traceroute fails, you will see it try 30 times (default) and output similar to below:

[root@ats10 ~]# traceroute -s 10.20.218.10 10.20.200.20
traceroute to 10.20.200.20 (10.20.200.20) from 10.20.218.10, 30 hops max, 46 byte packets
1 10.20.216.1 (10.20.216.1) 0.331 ms 0.319 ms 0.224 ms
2 * * *
3 * * *
4 * * *
5 * * *
..
..
30 ***

Recreate failure going both ways using traceroute between the source and target to get a sense of where the traceroute stops. This should help isolate what router on the network is blocking the trace.

#### ping ####

Use ping as high level verification that two hosts can communicate across a network

[root@ats10 ~]# ping 10.20.218.20
PING 10.20.218.20 (10.20.218.20) 56(84) bytes of data.
64 bytes from 10.20.218.20: icmp_seq=0 ttl=64 time=0.196 ms
64 bytes from 10.20.218.20: icmp_seq=1 ttl=64 time=0.404 ms
64 bytes from 10.20.218.20: icmp_seq=2 ttl=64 time=0.366 ms
64 bytes from 10.20.218.20: icmp_seq=3 ttl=64 time=0.327 ms

--- 10.20.218.20 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3000ms
rtt min/avg/max/mdev = 0.196/0.323/0.404/0.079 ms, pipe 2

Areas of interest are:

time values: should be quick
icmp_seq: ideally should be in order.
packet loss

Network bottlenecks can be suspect if time values look delayed and icmp_seq out of sequence.

Note by Bill Allen on 03/25/2011 02:11 PM

Title	Last Updated	Updated By
Netperf for Windows	02/25/2011 02:18 PM	Tom McFaul