The Metadata Controller System

The CPU power and memory capacity of the MDC System are important performance factors, as well as the number of file systems hosted per system. In order to ensure fast response time it is necessary to use dedicated systems, limit the number of file systems hosted per system (maximum 8), and have an adequate CPU and memory. Refer to the StorNext 5 User’s Guide for limits on the number of files per file system and per database instance.

Some metadata operations such as file creation can be CPU intensive, and benefit from increased CPU power.

Other operations can benefit greatly from increased memory, such as directory traversal. SNFS provides two config file settings that can be used to realize performance gains from increased memory: BufferCacheSize, and InodeCacheSize.

However, it is critical that the MDC system have enough physical memory available to ensure that the FSM process doesn’t get swapped out. Otherwise, severe performance degradation and system instability can result.

The operating system on the metadata controller must always be run in U.S. English. On Windows systems, this is done by setting the system locale to U.S. English.

Caution: As the File System Manager (FSM) supports over 1000 clients (with more than 1000 file requests per client), the resource limits of your MDC may be exhausted with additional load from other processes. Exceeding the file descriptor limit will cause errors to your system. Quantum recommends you not run additional applications on the MDC.

FSM Configuration File Settings

The following FSM configuration file settings are explained in greater detail in the snfs.cfgx file and snfs_config(5) man pages, which are available in the StorNext Man Pages Reference Guide posted here (click the “Select a StorNext Version” menu to view the desired documents):

http://www.quantum.com/sn5docs

Please refer there for setting details and an example file. For a sample FSM configuration file, see Example FSM Configuration File.

Stripe Groups

Splitting apart data, metadata, and journal into separate stripe groups is usually the most important performance tactic. The create, remove, and allocate (e.g., write) operations are very sensitive to I/O latency of the journal stripe group. However, if create, remove, and allocate performance aren't critical, it is okay to share a stripe group for both metadata and journal, but be sure to set the exclusive property on the stripe group so it doesn't get allocated for data as well.

Note: It is recommended that you have only a single metadata stripe group. For increased performance, use multiple LUNs (2 or 4) for the stripe group.

RAID 1 mirroring is optimal for metadata and journal storage. Utilizing the write-back caching feature of the RAID system (as described previously) is critical to optimizing performance of the journal and metadata stripe groups. Quantum recommends mapping no more than one LUN per RAID 1 set.

Example (Linux)

<stripeGroup index="0" name="MetaFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="true" journal="false" userdata="false" realTimeIOs="200" realTimeIOsReserve="1" realTimeMB="200" realTimeMBReserve="1" realTimeTokenTimeout="0" multipathMethod="rotate">

<disk index="0" diskLabel="CvfsDisk0" diskType="MetaDrive"/>

</stripeGroup>

<stripeGroup index="1" name="JournFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="false" journal="true" userdata="false" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">

<disk index="0" diskLabel="CvfsDisk1" diskType="JournalDrive"/>

</stripeGroup>

<stripeGroup index="4" name="RegularFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">

<disk index="0" diskLabel="CvfsDisk14" diskType="DataDrive"/>

<disk index="1" diskLabel="CvfsDisk15" diskType="DataDrive"/>

<disk index="2" diskLabel="CvfsDisk16" diskType="DataDrive"/>

<disk index="3" diskLabel="CvfsDisk17" diskType="DataDrive"/>

</stripeGroup>

Example (Windows)

[StripeGroup MetaFiles]

Status Up

StripeBreadth 256K

Metadata Yes

Journal No

Exclusive Yes

Read Enabled

Write Enabled

Rtmb 200

Rtios 200

RtmbReserve 1

RtiosReserve 1

RtTokenTimeout 0

MultiPathMethod Rotate

Node CvfsDisk0 0

[StripeGroup JournFiles]

Status Up

StripeBreadth 256K

Metadata No

Journal Yes

Exclusive Yes

Read Enabled

Write Enabled

Rtmb 0

Rtios 0

RtmbReserve 0

RtiosReserve 0

RtTokenTimeout 0

MultiPathMethod Rotate

Node CvfsDisk1 0

[StripeGroup RegularFiles]

Status Up

StripeBreadth 256K

Metadata No

Journal No

Exclusive No

Read Enabled

Write Enabled

Rtmb 0

Rtios 0

RtmbReserve 0

RtiosReserve 0

RtTokenTimeout 0

MultiPathMethod Rotate

Node CvfsDisk14 0

Node CvfsDisk15 1

Node CvfsDisk16 2

Node CvfsDisk17 3

Affinities

Affinities are another stripe group feature that can be very beneficial. Affinities can direct file allocation to appropriate stripe groups according to performance requirements. For example, stripe groups can be set up with unique hardware characteristics such as fast disk versus slow disk, or wide stripe versus narrow stripe. Affinities can then be employed to steer files to the appropriate stripe group.

For optimal performance, files that are accessed using large DMA-based I/O could be steered to wide-stripe stripe groups. Less performance-critical files could be steered to slow disk stripe groups. Small files could be steered clear of large files, or to narrow-stripe stripe groups.

Example (Linux)

<stripeGroup index="3" name="AudioFiles" status="up" stripeBreadth="1048576" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">

<affinities exclusive="true">

<affinity>Audio</affinity>

</affinities>

<disk index="0" diskLabel="CvfsDisk10" diskType="AudioDrive"/>

<disk index="1" diskLabel="CvfsDisk11" diskType="AudioDrive"/>

<disk index="2" diskLabel="CvfsDisk12" diskType="AudioDrive"/>

<disk index="3" diskLabel="CvfsDisk13" diskType="AudioDrive"/>

</stripeGroup>

Example (Windows)

[StripeGroup AudioFiles]

Status Up

StripeBreadth 1M

Metadata No

Journal No

Exclusive Yes

Read Enabled

Write Enabled

Rtmb 0

Rtios 0

RtmbReserve 0

RtiosReserve 0

RtTokenTimeout 0

MultiPathMethod Rotate

Node CvfsDisk10 0

Node CvfsDisk11 1

Node CvfsDisk12 2

Node CvfsDisk13 3

Affinity Audio

Note: Affinity names cannot be longer than eight characters.

StripeBreadth

This setting should match the RAID stripe size or be a multiple of the RAID stripe size. Matching the RAID stripe size is usually the most optimal setting. However, depending on the RAID performance characteristics and application I/O size, it might be beneficial to use a multiple or integer fraction of the RAID stripe size. For example, if the RAID stripe size is 256K, the stripe group contains 4 LUNs, and the application to be optimized uses DMA I/O with 8MB block size, a StripeBreadth setting of 2MB might be optimal. In this example the 8MB application I/O is issued as 4 concurrent 2MB I/Os to the RAID. This concurrency can provide up to a 4X performance increase. This StripeBreadth typically requires some experimentation to determine the RAID characteristics. The lmdd utility can be very helpful. Note that this setting is not adjustable after initial file system creation.

Optimal range for the StripeBreadth setting is 128K to multiple megabytes, but this varies widely.

Note: This setting cannot be changed after being put into production, so its important to choose the setting carefully during initial configuration.

Example (Linux)

<stripeGroup index="2" name="VideoFiles" status="up" stripeBreadth="4194304" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">

<affinities exclusive="true">

<affinity>Video</affinity>

</affinities>

<disk index="0" diskLabel="CvfsDisk2" diskType="VideoDrive"/>

<disk index="1" diskLabel="CvfsDisk3" diskType="VideoDrive"/>

<disk index="2" diskLabel="CvfsDisk4" diskType="VideoDrive"/>

<disk index="3" diskLabel="CvfsDisk5" diskType="VideoDrive"/>

<disk index="4" diskLabel="CvfsDisk6" diskType="VideoDrive"/>

<disk index="5" diskLabel="CvfsDisk7" diskType="VideoDrive"/>

<disk index="6" diskLabel="CvfsDisk8" diskType="VideoDrive"/>

<disk index="7" diskLabel="CvfsDisk9" diskType="VideoDrive"/>

</stripeGroup>

Example (Windows)

[StripeGroup VideoFiles]

Status Up

StripeBreadth 4M

Metadata No

Journal No

Exclusive Yes

Read Enabled

Write Enabled

Rtmb 0

Rtios 0

RtmbReserve 0

RtiosReserve 0

RtTokenTimeout 0

MultiPathMethod Rotate

Node CvfsDisk2 0

Node CvfsDisk3 1

Node CvfsDisk4 2

Node CvfsDisk5 3

Node CvfsDisk6 4

Node CvfsDisk7 5

Node CvfsDisk8 6

Node CvfsDisk9 7

Affinity Video

BufferCacheSize

Increasing this value can reduce latency of any metadata operation by performing a hot cache access to directory blocks, inode information, and other metadata info. This is about 10 to 1000 times faster than I/O. It is especially important to increase this setting if metadata I/O latency is high, (for example, more than 2ms average latency). Quantum recommends sizing this according to how much memory is available; more is better. Optimal settings for BufferCacheSize range from 32MB to 8GB for a new file system and can be increased up to 500GB as a file system grows. A higher setting is more effective if the CPU is not heavily loaded.

When the value of BufferCacheSize is greater than 1GB, SNFS uses compressed buffers to maximize the amount of cached metadata. The effective size of the cache is as follows:

If the BufferCacheSize is less than or equal to 1GB, then:

Effective Cache Size = BufferCacheSize

If the BufferCacheSize is greater than 1GB, then:

Effective Cache Size = (BufferCacheSize - 512MB) * 2.5

The value 2.5 in the above formula represents a typical level of compression. This factor may be somewhat lower or higher, depending on the complexity of the file system metadata.

Note: Configuring a large value of BufferCacheSize will increase the memory footprint of the FSM process. If this process crashes, a core file will be generated that will consume disk space proportional to its size.

Example (Linux)

<bufferCacheSize>268435456</bufferCacheSize>

Example (Windows)

BufferCacheSize 256MB

In StorNext 5, the default value for the BufferCacheSize parameter in the file system configuration file changed from 32 MB to 256 MB. While uncommon, if a file system configuration file is missing this parameter, the new value will be in effect. This may improve performance; however, the FSM process will use more memory than it did with previous releases (up to 400 MB).

To avoid the increased memory utilization, the BufferCacheSize parameter may be added to the file system configuration file with the old default values.

Example (Linux)

`<bufferCacheSize>33554432</bufferCacheSize>`

Example (Windows)

BufferCacheSize 32M

InodeCacheSize

This setting consumes about 1400 bytes of memory times the number specified. Increasing this value can reduce latency of any metadata operation by performing a hot cache access to inode information instead of an I/O to get inode info from disk, about 100 to 1000 times faster. It is especially important to increase this setting if metadata I/O latency is high, (for example, more than 2ms average latency). You should try to size this according to the sum number of working set files for all clients. Optimal settings for InodeCacheSize range from 16K to 128K for a new file system and can be increased to 256K or 512K as a file system grows. A higher setting is more effective if the CPU is not heavily loaded. For best performance, the InodeCacheSize should be at least 1024 times the number of megabytes allocated to the journal. For example, for a 64MB journal, the inodeCacheSize should be at least 64K.

Example (Linux)

<inodeCacheSize>131072</inodeCacheSize>

Example (Windows)

InodeCacheSize 128K

In StorNext 5, the default value for the InodeCacheSize parameter in the file system configuration file changed from 32768 to 131072. While uncommon, if a file system configuration file is missing this parameter, the new value will be in effect. This may improve performance; however, the FSM process will use more memory than it did with previous releases (up to 400 MB).

To avoid the increased memory utilization, the InodeCacheSize parameter may be added to the file system configuration file with the old default values.

Example (Linux)

<inodeCacheSize>32768</inodeCacheSize>

Example (Windows)

InodeCacheSize 32K

FsBlockSize

Beginning with StorNext 5, all SNFS file systems use a File System Block Size of 4KB. This is the optimal value and is no longer tunable. Any file systems created with pre-5 versions of StorNext having larger File System Block Sizes will be automatically converted to use 4KB the first time the file system is started with StorNext 5.

JournalSize

Beginning with StorNext 5, the recommended setting for JournalSize is 64Mbytes.

Increasing the JournalSize beyond 64Mbytes may be beneficial for workloads where many large size directories are being created, or removed at the same time. For example, workloads dealing with 100 thousand files in a directory and several directories at once will see improved throughput with a larger journal.

The downside of a larger journal size is potentially longer FSM startup and failover times.

Using a value less than 64Mbytes may improve failover time but reduce file system performance. Values less than 16Mbytes are not recommended.

Note: Journal replay has been optimized with StorNext 5 so a 64Mbytes journal will often replay significantly faster with StorNext 5 than a 16Mbytes journal did with prior releases.

A file system created with a pre-5 version of StorNext may have been configured with a small JournalSize. This is true for file systems created on Windows MDCs where the old default size of the journal was 4Mbytes. Journals of this size will continue to function with StorNext 5, but will experience a performance benefit if the size is increased to 64Mbytes. This can be adjusted using the cvupdatefs utility. For more information, see the command cvupdatefs in the StorNext MAN Pages Reference Guide.

If a file system previously had been configured with a JournalSize larger than 64Mbytes, there is no reason to reduce it to 64Mbytes when upgrading to StorNext 5.

Example (Linux)

<config configVersion="0" name="example" fsBlockSize="4096" journalSize="67108864">

Example (Windows)

JournalSize 64M

SNFS Tools

The snfsdefrag tool is very useful to identify and correct file extent fragmentation. Reducing extent fragmentation can be very beneficial for performance. You can use this utility to determine whether files are fragmented, and if so, fix them.

Qustats

The qustats are measuring overall metadata statistics, physical I/O, VOP statistics and client specific VOP statistics.

The overall metadata statistics include journal and cache information. All of these can be affected by changing the configuration parameters for a file system. Examples are increasing the journal size, and increasing cache sizes.

The physical I/O statistics show number and speed of disk I/Os. Poor numbers can indicate hardware problems or over-subscribed disks.

The VOP statistics show what requests SNFS clients making to the MDCs, which can show where workflow changes may improve performance.

The Client specific VOP statistics can show which clients are generating the VOP requests.

Examples of qustat operations;

Print the current stats to stdout

# qustat -g <file_system_name>

Print a description of a particular stat

# qustat -g <file_system_name> -D "<stat_name>"

Note: Use * for stat name to print descriptions on all stats.

For additional information on qustat, see the qustat man page.

Table 1: Qustat Counters

There are a large number of qustat counters available in the output file, most are debugging counters and are not useful in measuring or tuning the file system. The items in the table below have been identified as the most interesting counters. All other counters can be ignored.

Name	Type	Description
`Journal`	`Waits`	If the `Journal Waits` count is consistently high, the `Journal` size can be increased. Use `cvupdatefs` to increase the `JournalSize`.
`Cache Stats`	`Buffer Cache` (`Buf Hits` and `Buf Misses`)	If hit rate is low, FSM `BufferCacheSize` may need to be increased. The number of hits and misses are reported.
`Cache Stats`	`Inode Cache` (`ICa Hits` and `ICa Misses`)	If hit rate is low, `InodeCacheSize` may need to be increased. The number of hits and misses are reported.
`PhysIO Stats`	`Read/Write`	Physical metadata I/O statistics
	`max`	The maximum time to service a metadata read request.
	`sysmax`	The maximum time to complete read system call
	`average`	The average time to service a metadata read request.
`VOP Stats`		File and Directory operations
	`Create and Remove`	File create and remove operations
	`Cnt`	The count of operations in an hour
	`Mkdir and Rmdir`	Directory create and remove operations
	`Cnt`	The count of operations in an hour.
	`Rename`	File and directory rename/mv operations
	`Cnt`	The count of operations in an hour
	`Open and Close`	File open and close operations
	`Cnt`	The count of operations in an hour
	`ReadDir`	The application `readdir()` operation, `ls` or `dir` operations. These are heavyweight and should be minimized wherever possible
	`Cnt`	The count of operations in an hour
	`DirAttr`	Gather attributes for a directory listing (Windows only)
	Cnt	The count of operations in an hour
	Get Attr and Set Attr	Attribute updates and queries, touch, stat, implicit stat calls
	Cnt	The count of operations in an hour
`Client VOP Stats`	n/a	Per client VOP stats are available to determine which client may be putting a load on the MDC

The qustat command also supports the client module. The client is the StorNext file system driver that runs in the Linux kernel. Unlike the cvdb PERF traces, which track individual I/O operations, the qustat statisics group like operations into buckets and track minimum, maximum and average duration times. In the following output, one can see that the statistics show global counters for all file systems as well as counters for individual file system. In this example, only a single file system is shown.

# qustat -m client

# QuStat Rev 5.0.0

# Host client1

# Module client

# Group kernel

# Recorded time_t=1383930885 2013-11-08 11:14:45 CST

The VFSOPS and VNOPS table keep track of meta data operations. These statistics are also available from the FSM.

# Table 1: Global.VFSOPS

# Last Reset: Secs=68772 time_t=1383862113 2013-11-07 16:08:33 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Mount TIM 1 145651 145651 145651 145651

# Table 2: Global.VNOPS

# Last Reset: Secs=68772 time_t=1383862113 2013-11-07 16:08:33 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Lookup TIM 1409440 129 1981607 1075696112 763

Lookup Misses TIM 85999 119 1683984 55086899 641

Create TIM 345920 176 7106558 1558428932 4505

Link TIM 43236 144 1082276 109649645 2536

Open TIM 899448 0 249137 419685143 467

Close TIM 966263 0 1842135 239392382 248

Close Last Ref TIM 1583301 0 1842136 252638155 160

Flush TIM 85822 0 1461219 67287424 784

Delete TIM 86187 138 6951252 161379694 1872

Truncate TIM 86719 2 10788310 579502414 6683

Read Calls TIM 155568 5 11424282 7201269947 46290

Read Bytes SUM 155567 7 1048576 112968716468 726174

Write Calls TIM 396833 5 12133240 17325631059 43660

Write Bytes SUM 396831 1 1048576 193278769608 487056

# Table 3: fs.slfs.vnops

# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Lookup TIM 1409419 130 1981624 1075831257 763

Lookup Misses TIM 85999 119 1683984 55095943 641

Create TIM 345922 176 7106558 1558495382 4505

Link TIM 43235 145 1082276 109654274 2536

Open TIM 899450 0 249137 419766923 467

Close TIM 966264 0 1842135 239456503 248

Close Last Ref TIM 1583301 0 1842136 252713577 160

Flush TIM 85822 0 1461219 67294264 784

Delete TIM 86187 138 6951252 161388534 1873

Truncate TIM 86719 2 10788310 579507707 6683

Read Calls TIM 155567 5 11424281 7201242975 46290

Read Bytes SUM 155567 7 1048576 112968716468 726174

Write Calls TIM 396832 4 12133240 17325589658 43660

Write Bytes SUM 396833 1 1048576 193278769608 487053

The remaining tables show read and write performance. Table 4, ending in.san, displays reads and writes to attached storage.

# Table 4: fs.slfs.sg.VideoFiles.io.san

# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Rd Time Dev TIM 74911 138 3085649 753960684 10065

Rd Bytes Dev SUM 74909 512 1048576 16400501760 218939

Wrt Time Dev TIM 136359 190 3087236 2405089567 17638

Wrt Bytes Dev SUM 136351 4096 1048576 37450731520 274664

Err Wrt CNT 1 1 1 1 1

Retry Wrt CNT 1 1 1 1 1

Table 5, ending in.gw, displays i/o requests that were performed as a result of acting as a gateway to a StorNext distributed LAN client.

# Table 5: fs.slfs.sg.VideoFiles.io.gw

# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Rd Time Dev TIM 30077 150 895672 172615407 5739

Rd Bytes Dev SUM 30074 512 262144 4480934912 148997

Wrt Time Dev TIM 59780 173 797861 267709887 4478

Wrt Bytes Dev SUM 59779 4096 262144 10034552832 167861

Tables 6 through 9 show the same statistics, san and gateway, but they are broken up by stripe groups. The stripe group names are AudioFiles and RegularFiles.

You can use these statistics to determine the relative performance of different stripe groups. You can use affinities to direct I/O to particular stripe groups.

# Table 6: fs.slfs.sg.AudioFiles.io.san

# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Rd Time Dev TIM 265234 134 11722036 9525236181 35913

Rd Bytes Dev SUM 265226 512 1048576 49985565184 188464

Wrt Time Dev TIM 498170 173 11959871 19787582210 39721

Wrt Bytes Dev SUM 498128 4096 1048576 113701466112 228258

# Table 7: fs.slfs.sg.AudioFiles.io.gw

# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Rd Time Dev TIM 14621 158 2325615 35532936 2430

Rd Bytes Dev SUM 14621 512 262144 1957234176 133865

Wrt Time Dev TIM 31237 194 2316627 95964242 3072

Wrt Bytes Dev SUM 31237 4096 262144 4644196352 148676

# Table 8: fs.slfs.sg.RegularFiles.io.san

# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Rd Time Dev TIM 181916 148 5831201 5682515637 31237

Rd Bytes Dev SUM 181913 512 262144 22479920640 123575

Wrt Time Dev TIM 369178 190 5922770 15576716200 42193

Wrt Bytes Dev SUM 369148 4096 262144 51478306816 139452

# Table 9: fs.slfs.sg.RegularFiles.io.gw

# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Rd Time Dev TIM 21640 163 742691 85429364 3948

Rd Bytes Dev SUM 21636 512 262144 2638490112 121949

Wrt Time Dev TIM 43950 182 745754 160432969 3650

Wrt Bytes Dev SUM 43949 4096 262144 6106251264 138939

This is the end of example 1.

The next example is from a distributed LAN client. It has the same Global sections. It has no entries in the .san section since it is not a SAN client. It does not have .gw entries because it is not a gateway. It displays read/write stats in the .lan sections.

# qustat -m client

# QuStat Rev 5.0.0

# Host client2

# Module client

# Group kernel

# Recorded time_t=1383930897 2013-11-08 11:14:57 CST

# Table 1: Global.VFSOPS

# Last Reset: Secs=1071 time_t=1383929826 2013-11-08 10:57:06 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Mount TIM 1 31432 31432 31432 31432

# Table 2: Global.VNOPS

# Last Reset: Secs=1071 time_t=1383929826 2013-11-08 10:57:06 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Lookup TIM 279050 144 812119 247428083 887

Lookup Misses TIM 26454 141 115134 16098784 609

Create TIM 106236 194 1190095 510658586 4807

Link TIM 13436 129 722179 37626036 2800

Open TIM 275816 0 175466 112723260 409

Close TIM 312814 0 1012779 88674424 283

Close Last Ref TIM 299858 0 1012781 95911627 320

Flush TIM 26661 0 1359094 29984110 1125

Delete TIM 26618 147 558260 44942723 1688

Truncate TIM 26566 2 1105350 117272934 4414

Read Calls TIM 48294 4 2339980 812905246 16832

Read Bytes SUM 48294 2 1048576 35012379084 724984

Write Calls TIM 122068 5 3079900 2167773017 17759

Write Bytes SUM 122068 1 1048576 59774430318 489681

# Table 3: fs.slfs.vnops

# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Lookup TIM 279046 144 812119 247453265 887

Lookup Misses TIM 26454 141 115134 16101297 609

Create TIM 106235 194 1190095 510601746 4806

Link TIM 13436 129 722179 37627542 2801

Open TIM 275819 0 175466 112746924 409

Close TIM 312818 0 1012779 88695707 284

Close Last Ref TIM 299858 0 1012782 95926782 320

Flush TIM 26661 0 1359095 29986277 1125

Delete TIM 26618 147 558260 44945428 1689

Truncate TIM 26566 2 1105350 117274473 4414

Read Calls TIM 48294 4 2339980 812900201 16832

Read Bytes SUM 48293 2 1048576 35012379084 724999

Write Calls TIM 122068 5 3079899 2167759516 17759

Write Bytes SUM 122068 1 1048576 59774430318 489681

# Table 4: fs.slfs.sg.VideoFiles.io.san

# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

# Table 5: fs.slfs.sg.VideoFiles.io.lan

# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Rd Time Dev TIM 60101 290 1095898 836391520 13916

Rd Bytes Dev SUM 60101 512 1048576 13414037504 223192

Wrt Time Dev TIM 108047 444 1097372 2117952993 19602

Wrt Bytes Dev SUM 108033 4096 1048576 30341750784 280856

# Table 6: fs.slfs.sg.AudioFiles.io.san

# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

# Table 7: fs.slfs.sg.AudioFiles.io.lan

# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Rd Time Dev TIM 32229 362 2339935 300582018 9326

Rd Bytes Dev SUM 32223 512 1048576 6183428096 191895

Wrt Time Dev TIM 61055 474 2521769 766287139 12551

Wrt Bytes Dev SUM 61054 4096 1048576 14110711808 231119

Err Rd CNT 1 1 1 1 1

Retry Rd CNT 1 1 1 1 1

# Table 8: fs.slfs.sg.RegularFiles.io.san

# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

# Table 9: fs.slfs.sg.RegularFiles.io.lan

# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST

# NAME TYP COUNT MIN MAX TOT/LVL AVG

Rd Time Dev TIM 64259 365 812197 750876114 11685

Rd Bytes Dev SUM 64259 512 262144 7952000000 123749

Wrt Time Dev TIM 130317 469 1084678 1856862556 14249

Wrt Bytes Dev SUM 130304 4096 262144 18139078656 139206

SNFS supports the Windows Perfmon utility (see Windows Performance Monitor Counters). This provides many useful statistics counters for the SNFS client component. Run rmperfreg.exe and instperfreg.exe to set up the required registry settings. Next, call cvdb -P. After these steps, the SNFS counters should be visible to the Windows Perfmon utility. If not, check the Windows Application Event log for errors.

The cvcp utility is a higher performance alternative to commands such as cp and tar. The cvcp utility achieves high performance by using threads, large I/O buffers, preallocation, stripe alignment, DMA I/O transfer, and Bulk Create. Also, the cvcp utility uses the SNFS External API for preallocation and stripe alignment. In the directory-to-directory copy mode (for example, cvcp source_dir destination_dir,) cvcp conditionally uses the Bulk Create API to provide a dramatic small file copy performance boost. However, it will not use Bulk Create in some scenarios, such as non-root invocation, managed file systems, quotas, or Windows security. When Bulk Create is utilized, it significantly boosts performance by reducing the number of metadata operations issued. For example, up to 20 files can be created all with a single metadata operation. For more information, see the cvcp man page.

The cvmkfile utility provides a command line tool to utilize valuable SNFS performance features. These features include preallocation, stripe alignment, and affinities. See the cvmkfile man page.

The Lmdd utility is very useful to measure raw LUN performance as well as varied I/O transfer sizes. It is part of the lmbench package and is available from http://sourceforge.net.

The cvdbset utility has a special “Perf” trace flag that is very useful to analyze I/O performance. For example: cvdbset perf

Then, you can use cvdb -g to collect trace information such as this:

PERF: Device Write 41 MB/s IOs 2 exts 1 offs 0x0 len 0x400000 mics 95589 ino 0x5

PERF: VFS Write EofDmaAlgn 41 MB/s offs 0x0 len 0x400000 mics 95618 ino 0x5

The “PERF: Device” trace displays throughput measured for the device I/O. It also displays the number of I/Os into which it was broken, and the number of extents (sequence of consecutive filesystem blocks).

The “PERF: VFS” trace displays throughput measured for the read or write system call and significant aspects of the I/O, including:

Dma: DMA

Buf: Buffered

Eof: File extended

Algn: Well-formed DMA I/O

Shr: File is shared by another client

Rt: File is real time

Zr: Hole in file was zeroed

Both traces also report file offset, I/O size, latency (mics), and inode number.

Sample use cases:

Verify that I/O properties are as expected.

You can use the VFS trace to ensure that the displayed properties are consistent with expectations, such as being well formed; buffered versus DMA; shared/non-shared; or I/O size. If a small I/O is being performed DMA, performance will be poor. If DMA I/O is not well formed, it requires an extra data copy and may even be broken into small chunks. Zeroing holes in files has a performance impact.

Determine if metadata operations are impacting performance.

If VFS throughput is inconsistent or significantly less than Device throughput, it might be caused by metadata operations. In that case, it would be useful to display “fsmtoken,” “fsmvnops,” and “fsmdmig” traces in addition to “perf.”

Identify disk performance issues.

If Device throughput is inconsistent or less than expected, it might indicate a slow disk in a stripe group, or that RAID tuning is necessary.

Identify file fragmentation.

If the extent count “exts” is high, it might indicate a fragmentation problem.This causes the device I/Os to be broken into smaller chunks, which can significantly impact throughput.

Identify read/modify/write condition.

If buffered VFS writes are causing Device reads, it might be beneficial to match I/O request size to a multiple of the “cachebufsize” (default 64KB; see mount_cvfs man page). Another way to avoid this is by truncating the file before writing.

The cvadmin command includes a latency-test utility for measuring the latency between an FSM and one or more SNFS clients. This utility causes small messages to be exchanged between the FSM and clients as quickly as possible for a brief period of time, and reports the average time it took for each message to receive a response.

The latency-test command has the following syntax:

latency-test <index-number> [ <seconds> ]

latency-test all [ <seconds> ]

If an index-number is specified, the test is run between the currently-selected FSM and the specified client. (Client index numbers are displayed by the cvadmin who command). If all is specified, the test is run against each client in turn.

The test is run for 2 seconds, unless a value for seconds is specified.

Here is a sample run:

snadmin (lsi) > latency-test

Test started on client 1 (bigsky-node2)... latency 55us

Test started on client 2 (k4)... latency 163us

There is no rule-of-thumb for “good” or “bad” latency values. The observed latency for GbE is less than 60 microseconds. Latency can be affected by CPU load or SNFS load on either system, by unrelated Ethernet traffic, or other factors. However, for otherwise idle systems, differences in latency between different systems can indicate differences in hardware performance. (In the example above, the difference is a Gigabit Ethernet and faster CPU versus a 100BaseT Ethernet and a slower CPU.) Differences in latency over time for the same system can indicate new hardware problems, such as a network interface going bad.

If a latency test has been run for a particular client, the cvadmin who long command includes the test results in its output, along with information about when the test was last run.

Mount Command Options

The following SNFS mount command settings are explained in greater detail in the mount_cvfs man page.

The default size of the client buffer cache varies by platform and main memory size, and ranges between 32MB and 256MB. And, by default, each buffer is 64K so the cache contains between 512 and 4096 buffers. In general, increasing the size of the buffer cache will not improve performance for streaming reads and writes. However, a large cache helps greatly in cases of multiple concurrent streams, and where files are being written and subsequently read. Buffer cache size is adjusted with the buffercachecap setting.

The buffer cache I/O size is adjusted using the cachebufsize setting. The default setting is usually optimal; however, sometimes performance can be improved by increasing this setting to match the RAID 5 stripe size.

Note: In prior releases of StorNext, using a large cachebufsize setting could decrease small, random I/O READ performance. However, in StorNext 5, the buffer cache has been modified to avoid this issue.

The cachebufsize parameter is a mount option and can be unique for every client that mounts the file system.

Buffer cache read-ahead can be adjusted with the buffercache_readahead setting. When the system detects that a file is being read in its entirety, several buffer cache I/O daemons pre-fetch data from the file in the background for improved performance. The default setting is optimal in most scenarios.

The auto_dma_read_length and auto_dma_write_length settings determine the minimum transfer size where direct DMA I/O is performed instead of using the buffer cache for well-formed I/O. These settings can be useful when performance degradation is observed for small DMA I/O sizes compared to buffer cache.

For example, if buffer cache I/O throughput is 200 MB/sec but 512K DMA I/O size observes only 100MB/sec, it would be useful to determine which DMA I/O size matches the buffer cache performance and adjust auto_dma_read_length and auto_dma_write_length accordingly. The lmdd utility is handy here.

The dircachesize option sets the size of the directory information cache on the client. This cache can dramatically improve the speed of readdir operations by reducing metadata network message traffic between the SNFS client and FSM. Increasing this value improves performance in scenarios where very large directories are not observing the benefit of the client directory cache.

SNFS External API

The SNFS External API might be useful in some scenarios because it offers programmatic use of special SNFS performance capabilities such as affinities, preallocation, and quality of service. For more information, see the “Quality of Service” topic of the StorNext File System API Guide posted here (click the “Select a StorNext Version” menu to view the desired documents):

http://www.quantum.com/sn5docs