The Metadata Controller System
The CPU power and memory capacity of the MDC System are important performance factors, as well as the number of file systems hosted per system. In order to ensure fast response time it is necessary to use dedicated systems, limit the number of file systems hosted per system (maximum 8), and have an adequate CPU and memory. Refer to the StorNext 5 User’s Guide for limits on the number of files per file system and per database instance.
Some metadata operations such as file creation can be CPU intensive, and benefit from increased CPU power.
Other operations can benefit greatly from increased memory, such as directory traversal. SNFS provides two config file settings that can be used to realize performance gains from increased memory: BufferCacheSize
, and InodeCacheSize
.
However, it is critical that the MDC system have enough physical memory available to ensure that the FSM process doesn’t get swapped out. Otherwise, severe performance degradation and system instability can result.
The operating system on the metadata controller must always be run in U.S. English. On Windows systems, this is done by setting the system locale to U.S. English.
Caution: As the File System Manager (FSM) supports over 1000 clients (with more than 1000 file requests per client), the resource limits of your MDC may be exhausted with additional load from other processes. Exceeding the file descriptor limit will cause errors to your system. Quantum recommends you not run additional applications on the MDC.
FSM Configuration File Settings
The following FSM configuration file settings are explained in greater detail in the snfs.cfgx
file and snfs_config(5)
man pages, which are available in the StorNext Man Pages Reference Guide posted here (click the “Select a StorNext Version” menu to view the desired documents):
http://www.quantum.com/sn5docs
Please refer there for setting details and an example file. For a sample FSM configuration file, see Example FSM Configuration File.
Stripe Groups
Splitting apart data, metadata, and journal into separate stripe groups is usually the most important performance tactic. The create, remove, and allocate (e.g., write) operations are very sensitive to I/O latency of the journal stripe group. However, if create, remove, and allocate performance aren't critical, it is okay to share a stripe group for both metadata and journal, but be sure to set the exclusive property on the stripe group so it doesn't get allocated for data as well.
Note: It is recommended that you have only a single metadata stripe group. For increased performance, use multiple LUNs (2 or 4) for the stripe group.
RAID 1 mirroring is optimal for metadata and journal storage. Utilizing the write-back caching feature of the RAID system (as described previously) is critical to optimizing performance of the journal and metadata stripe groups. Quantum recommends mapping no more than one LUN per RAID 1 set.
Example (Linux)
<stripeGroup index="0" name="MetaFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="true" journal="false" userdata="false" realTimeIOs="200" realTimeIOsReserve="1" realTimeMB="200" realTimeMBReserve="1" realTimeTokenTimeout="0" multipathMethod="rotate">
<disk index="0" diskLabel="CvfsDisk0" diskType="MetaDrive"/>
</stripeGroup>
<stripeGroup index="1" name="JournFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="false" journal="true" userdata="false" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">
<disk index="0" diskLabel="CvfsDisk1" diskType="JournalDrive"/>
</stripeGroup>
<stripeGroup index="4" name="RegularFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">
<disk index="0" diskLabel="CvfsDisk14" diskType="DataDrive"/>
<disk index="1" diskLabel="CvfsDisk15" diskType="DataDrive"/>
<disk index="2" diskLabel="CvfsDisk16" diskType="DataDrive"/>
<disk index="3" diskLabel="CvfsDisk17" diskType="DataDrive"/>
</stripeGroup>
Example (Windows)
[StripeGroup MetaFiles]
Status Up
StripeBreadth 256K
Metadata Yes
Journal No
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 200
Rtios 200
RtmbReserve 1
RtiosReserve 1
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk0 0
[StripeGroup JournFiles]
Status Up
StripeBreadth 256K
Metadata No
Journal Yes
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk1 0
[StripeGroup RegularFiles]
Status Up
StripeBreadth 256K
Metadata No
Journal No
Exclusive No
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk14 0
Node CvfsDisk15 1
Node CvfsDisk16 2
Node CvfsDisk17 3
Affinities
Affinities are another stripe group feature that can be very beneficial. Affinities can direct file allocation to appropriate stripe groups according to performance requirements. For example, stripe groups can be set up with unique hardware characteristics such as fast disk versus slow disk, or wide stripe versus narrow stripe. Affinities can then be employed to steer files to the appropriate stripe group.
For optimal performance, files that are accessed using large DMA-based I/O could be steered to wide-stripe stripe groups. Less performance-critical files could be steered to slow disk stripe groups. Small files could be steered clear of large files, or to narrow-stripe stripe groups.
Example (Linux)
<stripeGroup index="3" name="AudioFiles" status="up" stripeBreadth="1048576" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">
<affinities exclusive="true">
<affinity>Audio</affinity>
</affinities>
<disk index="0" diskLabel="CvfsDisk10" diskType="AudioDrive"/>
<disk index="1" diskLabel="CvfsDisk11" diskType="AudioDrive"/>
<disk index="2" diskLabel="CvfsDisk12" diskType="AudioDrive"/>
<disk index="3" diskLabel="CvfsDisk13" diskType="AudioDrive"/>
</stripeGroup>
Example (Windows)
[StripeGroup AudioFiles]
Status Up
StripeBreadth 1M
Metadata No
Journal No
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk10 0
Node CvfsDisk11 1
Node CvfsDisk12 2
Node CvfsDisk13 3
Affinity Audio
Note: Affinity names cannot be longer than eight characters.
StripeBreadth
This setting should match the RAID stripe size or be a multiple of the RAID stripe size. Matching the RAID stripe size is usually the most optimal setting. However, depending on the RAID performance characteristics and application I/O size, it might be beneficial to use a multiple or integer fraction of the RAID stripe size. For example, if the RAID stripe size is 256K, the stripe group contains 4 LUNs, and the application to be optimized uses DMA I/O with 8MB block size, a StripeBreadth
setting of 2MB might be optimal. In this example the 8MB application I/O is issued as 4 concurrent 2MB I/Os to the RAID. This concurrency can provide up to a 4X performance increase. This StripeBreadth typically requires some experimentation to determine the RAID characteristics. The lmdd
utility can be very helpful. Note that this setting is not adjustable after initial file system creation.
Optimal range for the StripeBreadth
setting is 128K to multiple megabytes, but this varies widely.
Note: This setting cannot be changed after being put into production, so its important to choose the setting carefully during initial configuration.
Example (Linux)
<stripeGroup index="2" name="VideoFiles" status="up" stripeBreadth="4194304" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">
<affinities exclusive="true">
<affinity>Video</affinity>
</affinities>
<disk index="0" diskLabel="CvfsDisk2" diskType="VideoDrive"/>
<disk index="1" diskLabel="CvfsDisk3" diskType="VideoDrive"/>
<disk index="2" diskLabel="CvfsDisk4" diskType="VideoDrive"/>
<disk index="3" diskLabel="CvfsDisk5" diskType="VideoDrive"/>
<disk index="4" diskLabel="CvfsDisk6" diskType="VideoDrive"/>
<disk index="5" diskLabel="CvfsDisk7" diskType="VideoDrive"/>
<disk index="6" diskLabel="CvfsDisk8" diskType="VideoDrive"/>
<disk index="7" diskLabel="CvfsDisk9" diskType="VideoDrive"/>
</stripeGroup>
Example (Windows)
[StripeGroup VideoFiles]
Status Up
StripeBreadth 4M
Metadata No
Journal No
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk2 0
Node CvfsDisk3 1
Node CvfsDisk4 2
Node CvfsDisk5 3
Node CvfsDisk6 4
Node CvfsDisk7 5
Node CvfsDisk8 6
Node CvfsDisk9 7
Affinity Video
BufferCacheSize
Increasing this value can reduce latency of any metadata operation by performing a hot cache access to directory blocks, inode information, and other metadata info. This is about 10 to 1000 times faster than I/O. It is especially important to increase this setting if metadata I/O latency is high, (for example, more than 2ms average latency). Quantum recommends sizing this according to how much memory is available; more is better. Optimal settings for BufferCacheSize
range from 32MB to 8GB for a new file system and can be increased up to 500GB as a file system grows. A higher setting is more effective if the CPU is not heavily loaded.
When the value of BufferCacheSize
is greater than 1GB, SNFS uses compressed buffers to maximize the amount of cached metadata. The effective size of the cache is as follows:
If the BufferCacheSize
is less than or equal to 1GB, then:
Effective Cache Size = BufferCacheSize
If the BufferCacheSize
is greater than 1GB, then:
Effective Cache Size = (BufferCacheSize
- 512MB) * 2.5
The value 2.5 in the above formula represents a typical level of compression. This factor may be somewhat lower or higher, depending on the complexity of the file system metadata.
Note: Configuring a large value of BufferCacheSize
will increase the memory footprint of the FSM process. If this process crashes, a core file will be generated that will consume disk space proportional to its size.
Example (Linux)
<bufferCacheSize>268435456</bufferCacheSize>
Example (Windows)
BufferCacheSize 256MB
In StorNext 5, the default value for the BufferCacheSize
parameter in the file system configuration file changed from 32 MB to 256 MB
.
While uncommon, if a file system configuration file is missing this parameter, the new value will be in effect. This may improve performance; however, the FSM process will use more memory than it did with previous releases (up to 400 MB).
To avoid the increased memory utilization, the BufferCacheSize
parameter may be added to the file system configuration file with the old default values.
Example (Linux)
<bufferCacheSize>33554432</bufferCacheSize>
Example (Windows)
BufferCacheSize 32M
InodeCacheSize
This setting consumes about 1400 bytes of memory times the number specified. Increasing this value can reduce latency of any metadata operation by performing a hot cache access to inode information instead of an I/O to get inode info from disk, about 100 to 1000 times faster. It is especially important to increase this setting if metadata I/O latency is high, (for example, more than 2ms average latency). You should try to size this according to the sum number of working set files for all clients. Optimal settings for InodeCacheSize
range from 16K to 128K for a new file system and can be increased to 256K or 512K as a file system grows. A higher setting is more effective if the CPU is not heavily loaded. For best performance, the InodeCacheSize
should be at least 1024 times the number of megabytes allocated to the journal. For example, for a 64MB journal, the inodeCacheSize
should be at least 64K.
Example (Linux)
<inodeCacheSize>131072</inodeCacheSize>
Example (Windows)
InodeCacheSize 128K
In StorNext 5, the default value for the InodeCacheSize
parameter in the file system configuration file changed from 32768 to 131072
.
While uncommon, if a file system configuration file is missing this parameter, the new value will be in effect. This may improve performance; however, the FSM process will use more memory than it did with previous releases (up to 400 MB).
To avoid the increased memory utilization, the InodeCacheSize
parameter may be added to the file system configuration file with the old default values.
Example (Linux)
<inodeCacheSize>32768</inodeCacheSize>
Example (Windows)
InodeCacheSize 32K
FsBlockSize
Beginning with StorNext 5, all SNFS file systems use a File System Block Size of 4KB. This is the optimal value and is no longer tunable. Any file systems created with pre-5 versions of StorNext having larger File System Block Sizes will be automatically converted to use 4KB the first time the file system is started with StorNext 5.
JournalSize
Beginning with StorNext 5, the recommended setting for JournalSize
is 64Mbytes.
Increasing the JournalSize
beyond 64Mbytes may be beneficial for workloads where many large size directories are being created, or removed at the same time. For example, workloads dealing with 100 thousand files in a directory and several directories at once will see improved throughput with a larger journal.
The downside of a larger journal size is potentially longer FSM startup and failover times.
Using a value less than 64Mbytes may improve failover time but reduce file system performance. Values less than 16Mbytes are not recommended.
Note: Journal replay has been optimized with StorNext 5 so a 64Mbytes journal will often replay significantly faster with StorNext 5 than a 16Mbytes journal did with prior releases.
A file system created with a pre-5 version of StorNext may have been configured with a small JournalSize
. This is true for file systems created on Windows MDCs where the old default size of the journal was 4Mbytes. Journals of this size will continue to function with StorNext 5, but will experience a performance benefit if the size is increased to 64Mbytes. This can be adjusted using the cvupdatefs utility. For more information, see the command cvupdatefs
in the StorNext MAN Pages Reference Guide.
If a file system previously had been configured with a JournalSize
larger than 64Mbytes, there is no reason to reduce it to 64Mbytes when upgrading to StorNext 5.
Example (Linux)
<config configVersion="0" name="example" fsBlockSize="4096" journalSize="67108864">
Example (Windows)
JournalSize 64M
SNFS Tools
The snfsdefrag tool is very useful to identify and correct file extent fragmentation. Reducing extent fragmentation can be very beneficial for performance. You can use this utility to determine whether files are fragmented, and if so, fix them.
Qustats
The qustats are measuring overall metadata statistics, physical I/O, VOP statistics and client specific VOP statistics.
The overall metadata statistics include journal and cache information. All of these can be affected by changing the configuration parameters for a file system. Examples are increasing the journal size, and increasing cache sizes.
The physical I/O statistics show number and speed of disk I/Os. Poor numbers can indicate hardware problems or over-subscribed disks.
The VOP statistics show what requests SNFS clients making to the MDCs, which can show where workflow changes may improve performance.
The Client specific VOP statistics can show which clients are generating the VOP requests.
Examples of qustat operations;
Print the current stats to stdout
|
Print a description of a particular stat
|
Note: Use *
for stat name to print descriptions on all stats.
For additional information on qustat, see the qustat man page.
Table 1: Qustat Counters
There are a large number of qustat counters available in the output file, most are debugging counters and are not useful in measuring or tuning the file system. The items in the table below have been identified as the most interesting counters. All other counters can be ignored.
Name |
Type |
Description |
|
|
|
If the |
|
|
|
If hit rate is low, FSM |
|
|
If hit rate is low, |
||
|
|
Physical metadata |
|
|
The maximum time to service a metadata read request. |
||
|
The maximum time to complete read system call |
||
|
The average time to service a metadata read request. |
||
|
|
File and Directory operations |
|
|
File create and remove operations |
||
|
The count of operations in an hour |
||
|
Directory create and remove operations |
||
|
The count of operations in an hour. |
||
|
File and directory rename/mv operations |
||
|
The count of operations in an hour |
||
|
File open and close operations |
||
|
The count of operations in an hour |
||
|
The application |
||
|
The count of operations in an hour |
||
|
Gather attributes for a directory listing (Windows only) |
||
Cnt |
The count of operations in an hour |
||
Get Attr and Set Attr |
Attribute updates and queries, touch, stat, implicit stat calls |
||
Cnt |
The count of operations in an hour |
||
|
n/a |
Per client VOP stats are available to determine which client may be putting a load on the MDC |
The qustat
command also supports the client module. The client is the StorNext file system driver that runs in the Linux kernel. Unlike the cvdb
PERF traces, which track individual I/O operations, the qustat statisics group like operations into buckets and track minimum, maximum and average duration times. In the following output, one can see that the statistics show global counters for all file systems as well as counters for individual file system. In this example, only a single file system is shown.
|
The VFSOPS and VNOPS table keep track of meta data operations. These statistics are also available from the FSM.
|
|
The remaining tables show read and write performance. Table 4, ending in.san, displays reads and writes to attached storage.
|
Table 5, ending in.gw, displays i/o requests that were performed as a result of acting as a gateway to a StorNext distributed LAN client.
|
Tables 6 through 9 show the same statistics, san and gateway, but they are broken up by stripe groups. The stripe group names are AudioFiles and RegularFiles.
You can use these statistics to determine the relative performance of different stripe groups. You can use affinities to direct I/O to particular stripe groups.
|
|
This is the end of example 1.
The next example is from a distributed LAN client. It has the same Global sections. It has no entries in the .san section since it is not a SAN client. It does not have .gw entries because it is not a gateway. It displays read/write stats in the .lan sections.
|
|
|
|
|
SNFS supports the Windows Perfmon utility (see Windows Performance Monitor Counters). This provides many useful statistics counters for the SNFS client component. Run rmperfreg.exe and instperfreg.exe to set up the required registry settings. Next, call cvdb -P. After these steps, the SNFS counters should be visible to the Windows Perfmon utility. If not, check the Windows Application Event log for errors.
The cvcp utility is a higher performance alternative to commands such as cp and tar. The cvcp utility achieves high performance by using threads, large I/O buffers, preallocation, stripe alignment, DMA I/O transfer, and Bulk Create. Also, the cvcp utility uses the SNFS External API for preallocation and stripe alignment. In the directory-to-directory copy mode (for example, cvcp source_dir destination_dir,) cvcp conditionally uses the Bulk Create API to provide a dramatic small file copy performance boost. However, it will not use Bulk Create in some scenarios, such as non-root invocation, managed file systems, quotas, or Windows security. When Bulk Create is utilized, it significantly boosts performance by reducing the number of metadata operations issued. For example, up to 20 files can be created all with a single metadata operation. For more information, see the cvcp man page.
The cvmkfile utility provides a command line tool to utilize valuable SNFS performance features. These features include preallocation, stripe alignment, and affinities. See the cvmkfile man page.
The Lmdd utility is very useful to measure raw LUN performance as well as varied I/O transfer sizes. It is part of the lmbench package and is available from http://sourceforge.net.
The cvdbset utility has a special “Perf” trace flag that is very useful to analyze I/O performance. For example: cvdbset perf
Then, you can use cvdb -g to collect trace information such as this:
PERF: Device Write 41 MB/s IOs 2 exts 1 offs 0x0 len 0x400000 mics 95589 ino 0x5
PERF: VFS Write EofDmaAlgn 41 MB/s offs 0x0 len 0x400000 mics 95618 ino 0x5
The “PERF: Device” trace displays throughput measured for the device I/O. It also displays the number of I/Os into which it was broken, and the number of extents (sequence of consecutive filesystem blocks).
The “PERF: VFS” trace displays throughput measured for the read or write system call and significant aspects of the I/O, including:
Dma: DMA
Buf: Buffered
Eof: File extended
Algn: Well-formed DMA I/O
Shr: File is shared by another client
Rt: File is real time
Zr: Hole in file was zeroed
Both traces also report file offset, I/O size, latency (mics), and inode number.
Sample use cases:
Verify that I/O properties are as expected.
You can use the VFS trace to ensure that the displayed properties are consistent with expectations, such as being well formed; buffered versus DMA; shared/non-shared; or I/O size. If a small I/O is being performed DMA, performance will be poor. If DMA I/O is not well formed, it requires an extra data copy and may even be broken into small chunks. Zeroing holes in files has a performance impact.
Determine if metadata operations are impacting performance.
If VFS throughput is inconsistent or significantly less than Device throughput, it might be caused by metadata operations. In that case, it would be useful to display “fsmtoken,” “fsmvnops,” and “fsmdmig” traces in addition to “perf.”
Identify disk performance issues.
If Device throughput is inconsistent or less than expected, it might indicate a slow disk in a stripe group, or that RAID tuning is necessary.
Identify file fragmentation.
If the extent count “exts” is high, it might indicate a fragmentation problem.This causes the device I/Os to be broken into smaller chunks, which can significantly impact throughput.
Identify read/modify/write condition.
If buffered VFS writes are causing Device reads, it might be beneficial to match I/O request size to a multiple of the “cachebufsize” (default 64KB; see mount_cvfs man page). Another way to avoid this is by truncating the file before writing.
The cvadmin command includes a latency-test utility for measuring the latency between an FSM and one or more SNFS clients. This utility causes small messages to be exchanged between the FSM and clients as quickly as possible for a brief period of time, and reports the average time it took for each message to receive a response.
The latency-test command has the following syntax:
latency-test <index-number> [ <seconds> ]
latency-test all [ <seconds> ]
If an index-number is specified, the test is run between the currently-selected FSM and the specified client. (Client index numbers are displayed by the cvadmin who command). If all is specified, the test is run against each client in turn.
The test is run for 2 seconds, unless a value for seconds is specified.
Here is a sample run:
snadmin (lsi) > latency-test
Test started on client 1 (bigsky-node2)... latency 55us
Test started on client 2 (k4)... latency 163us
There is no rule-of-thumb for “good” or “bad” latency values. The observed latency for GbE is less than 60 microseconds. Latency can be affected by CPU load or SNFS load on either system, by unrelated Ethernet traffic, or other factors. However, for otherwise idle systems, differences in latency between different systems can indicate differences in hardware performance. (In the example above, the difference is a Gigabit Ethernet and faster CPU versus a 100BaseT Ethernet and a slower CPU.) Differences in latency over time for the same system can indicate new hardware problems, such as a network interface going bad.
If a latency test has been run for a particular client, the cvadmin who long command includes the test results in its output, along with information about when the test was last run.
Mount Command Options
The following SNFS mount command settings are explained in greater detail in the mount_cvfs man page.
The default size of the client buffer cache varies by platform and main memory size, and ranges between 32MB and 256MB. And, by default, each buffer is 64K so the cache contains between 512 and 4096 buffers. In general, increasing the size of the buffer cache will not improve performance for streaming reads and writes. However, a large cache helps greatly in cases of multiple concurrent streams, and where files are being written and subsequently read. Buffer cache size is adjusted with the buffercachecap setting.
The buffer cache I/O size is adjusted using the cachebufsize setting. The default setting is usually optimal; however, sometimes performance can be improved by increasing this setting to match the RAID 5 stripe size.
Note: In prior releases of StorNext, using a large cachebufsize
setting could decrease small, random I/O READ performance. However, in StorNext 5, the buffer cache has been modified to avoid this issue.
The cachebufsize parameter is a mount option and can be unique for every client that mounts the file system.
Buffer cache read-ahead can be adjusted with the buffercache_readahead setting. When the system detects that a file is being read in its entirety, several buffer cache I/O daemons pre-fetch data from the file in the background for improved performance. The default setting is optimal in most scenarios.
The auto_dma_read_length and auto_dma_write_length settings determine the minimum transfer size where direct DMA I/O is performed instead of using the buffer cache for well-formed I/O. These settings can be useful when performance degradation is observed for small DMA I/O sizes compared to buffer cache.
For example, if buffer cache I/O throughput is 200 MB/sec but 512K DMA I/O size observes only 100MB/sec, it would be useful to determine which DMA I/O size matches the buffer cache performance and adjust auto_dma_read_length and auto_dma_write_length accordingly. The lmdd utility is handy here.
The dircachesize option sets the size of the directory information cache on the client. This cache can dramatically improve the speed of readdir operations by reducing metadata network message traffic between the SNFS client and FSM. Increasing this value improves performance in scenarios where very large directories are not observing the benefit of the client directory cache.
SNFS External API
The SNFS External API might be useful in some scenarios because it offers programmatic use of special SNFS performance capabilities such as affinities, preallocation, and quality of service. For more information, see the “Quality of Service” topic of the StorNext File System API Guide posted here (click the “Select a StorNext Version” menu to view the desired documents):