The Metadata Controller System
The CPU power and memory capacity of the MDC System are important performance factors, as well as the number of file systems hosted per system. In order to ensure fast response time it is necessary to use dedicated systems, limit the number of file systems hosted per system (maximum 8), and have an adequate CPU and memory. See StorNext Limits for limits on the number of files per file system and per database instance.
Some metadata operations such as file creation can be CPU intensive, and benefit from increased CPU power.
Other operations can benefit greatly from increased memory, such as directory traversal. SNFS provides two config file settings that can be used to realize performance gains from increased memory:
BufferCacheSize
InodeCacheSize
However, it is critical that the MDC system have enough physical memory available to ensure that the FSM process doesn’t get swapped out. Otherwise, severe performance degradation and system instability can result.
The operating system on the metadata controller must always be run in U.S. English. On Windows systems, this is done by setting the system locale to U.S. English.
Caution: As the File System Manager (FSM) supports over 1000 clients (with more than 1000 file requests per client), the resource limits of your MDC may be exhausted with additional load from other processes. Exceeding the file descriptor limit will cause errors to your system. Quantum recommends you not run additional applications on the MDC.

The following FSM configuration file settings are explained in greater detail in the snfs.cfgx
file and snfs_config(5)
man pages, which are available in the StorNext Man Pages Reference Guide posted here https://www.quantum.com/snsdocs.
Please refer there for setting details and an example file. For a sample FSM configuration file, see Example FSM Configuration File.

Splitting apart data, metadata, and journal into separate stripe groups is usually the most important performance tactic. The create, remove, and allocate (for example, write) operations are very sensitive to I/O latency of the journal stripe group. However, if create, remove, and allocate performance aren't critical, it is okay to share a stripe group for both metadata and journal, but be sure to set the exclusive property on the stripe group so it does not get allocated for data as well.
Note: It is recommended that you have only a single metadata stripe group. For increased performance, use multiple LUNs (2 or 4) for the stripe group.
RAID 1 mirroring is optimal for metadata and journal storage. Utilizing the write-back caching feature of the RAID system (as described previously) is critical to optimizing performance of the journal and metadata stripe groups. Quantum recommends mapping no more than one LUN per RAID 1 set.
Example (Linux)
<stripeGroup index="0" name="MetaFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="true" journal="false" userdata="false" realTimeIOs="200" realTimeIOsReserve="1" realTimeMB="200" realTimeMBReserve="1" realTimeTokenTimeout="0" multipathMethod="rotate">
<disk index="0" diskLabel="CvfsDisk0" diskType="MetaDrive"/>
</stripeGroup>
<stripeGroup index="1" name="JournFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="false" journal="true" userdata="false" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">
<disk index="0" diskLabel="CvfsDisk1" diskType="JournalDrive"/>
</stripeGroup>
<stripeGroup index="4" name="RegularFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">
<disk index="0" diskLabel="CvfsDisk14" diskType="DataDrive"/>
<disk index="1" diskLabel="CvfsDisk15" diskType="DataDrive"/>
<disk index="2" diskLabel="CvfsDisk16" diskType="DataDrive"/>
<disk index="3" diskLabel="CvfsDisk17" diskType="DataDrive"/>
</stripeGroup>
Example (Windows)
[StripeGroup MetaFiles]
Status Up
StripeBreadth 256K
Metadata Yes
Journal No
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 200
Rtios 200
RtmbReserve 1
RtiosReserve 1
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk0 0
[StripeGroup JournFiles]
Status Up
StripeBreadth 256K
Metadata No
Journal Yes
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk1 0
[StripeGroup RegularFiles]
Status Up
StripeBreadth 256K
Metadata No
Journal No
Exclusive No
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk14 0
Node CvfsDisk15 1
Node CvfsDisk16 2
Node CvfsDisk17 3

Affinities are another stripe group feature that can be very beneficial. Affinities can direct file allocation to appropriate stripe groups according to performance requirements. For example, stripe groups can be set up with unique hardware characteristics such as fast disk versus slow disk, or wide stripe versus narrow stripe. Affinities can then be employed to steer files to the appropriate stripe group.
For optimal performance, files that are accessed using large DMA-based I/O could be steered to wide-stripe stripe groups. Less performance-critical files could be steered to slow disk stripe groups. Small files could be steered clear of large files, or to narrow-stripe stripe groups.
Example (Linux)
<stripeGroup index="3" name="AudioFiles" status="up" stripeBreadth="1048576" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">
<affinities exclusive="true">
<affinity>Audio</affinity>
</affinities>
<disk index="0" diskLabel="CvfsDisk10" diskType="AudioDrive"/>
<disk index="1" diskLabel="CvfsDisk11" diskType="AudioDrive"/>
<disk index="2" diskLabel="CvfsDisk12" diskType="AudioDrive"/>
<disk index="3" diskLabel="CvfsDisk13" diskType="AudioDrive"/>
</stripeGroup>
Example (Windows)
[StripeGroup AudioFiles]
Status Up
StripeBreadth 1M
Metadata No
Journal No
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk10 0
Node CvfsDisk11 1
Node CvfsDisk12 2
Node CvfsDisk13 3
Affinity Audio
Note: Affinity names cannot be longer than eight characters.

This setting should match the RAID stripe size or be a multiple of the RAID stripe size. Matching the RAID stripe size is usually the most optimal setting. However, depending on the RAID performance characteristics and application I/O size, it might be beneficial to use a multiple or integer fraction of the RAID stripe size. For example, if the RAID stripe size is 256K, the stripe group contains 4 LUNs, and the application to be optimized uses DMA I/O with 8MB block size, a StripeBreadth
setting of 2MB might be optimal. In this example the 8MB application I/O is issued as 4 concurrent 2MB I/Os to the RAID. This concurrency can provide up to a 4X performance increase. This StripeBreadth typically requires some experimentation to determine the RAID characteristics. The lmdd
utility can be very helpful. Note that this setting is not adjustable after initial file system creation.
Optimal range for the StripeBreadth
setting is 128K to multiple megabytes, but this varies widely.
Note: This setting cannot be changed after being put into production, so its important to choose the setting carefully during initial configuration.
Example (Linux)
<stripeGroup index="2" name="VideoFiles" status="up" stripeBreadth="4194304" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">
<affinities exclusive="true">
<affinity>Video</affinity>
</affinities>
<disk index="0" diskLabel="CvfsDisk2" diskType="VideoDrive"/>
<disk index="1" diskLabel="CvfsDisk3" diskType="VideoDrive"/>
<disk index="2" diskLabel="CvfsDisk4" diskType="VideoDrive"/>
<disk index="3" diskLabel="CvfsDisk5" diskType="VideoDrive"/>
<disk index="4" diskLabel="CvfsDisk6" diskType="VideoDrive"/>
<disk index="5" diskLabel="CvfsDisk7" diskType="VideoDrive"/>
<disk index="6" diskLabel="CvfsDisk8" diskType="VideoDrive"/>
<disk index="7" diskLabel="CvfsDisk9" diskType="VideoDrive"/>
</stripeGroup>
Example (Windows)
[StripeGroup VideoFiles]
Status Up
StripeBreadth 4M
Metadata No
Journal No
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk2 0
Node CvfsDisk3 1
Node CvfsDisk4 2
Node CvfsDisk5 3
Node CvfsDisk6 4
Node CvfsDisk7 5
Node CvfsDisk8 6
Node CvfsDisk9 7
Affinity Video

Increasing this value can reduce latency of any metadata operation by performing a hot cache access to directory blocks, inode information, and other metadata info. This is about 10 to 1000 times faster than I/O. It is especially important to increase this setting if metadata I/O latency is high, (for example, more than 2ms average latency). Quantum recommends sizing this according to how much memory is available; more is better. Optimal settings for BufferCacheSize
range from 32MB to 8GB for a new file system and can be increased up to 500GB as a file system grows. A higher setting is more effective if the CPU is not heavily loaded.
When the value of BufferCacheSize
is greater than 1GB, SNFS uses compressed buffers to maximize the amount of cached metadata. The effective size of the cache is as follows:
If the BufferCacheSize
is less than or equal to 1GB, then:
Effective Cache Size = BufferCacheSize
If the BufferCacheSize
is greater than 1GB, then:
Effective Cache Size = (BufferCacheSize
- 512MB) * 2.5
The value 2.5 in the above formula represents a typical level of compression. This factor may be somewhat lower or higher, depending on the complexity of the file system metadata.
Note: Configuring a large value of BufferCacheSize
will increase the memory footprint of the FSM process. If this process crashes, a core file will be generated that will consume disk space proportional to its size.
Example (Linux)
<bufferCacheSize>268435456</bufferCacheSize>
Example (Windows)
BufferCacheSize 256MB
In StorNext, the default value for the BufferCacheSize
parameter in the file system configuration file changed from 32 MB to 256 MB. While uncommon, if a file system configuration file is missing this parameter, the new value will be in effect. This may improve performance; however, the FSM process will use more memory than it did with previous releases (up to 400 MB).
To avoid the increased memory utilization, the BufferCacheSize
parameter may be added to the file system configuration file with the old default values.
Example (Linux)
<bufferCacheSize>33554432</bufferCacheSize>
Example (Windows)
BufferCacheSize 32M

This setting consumes about 1400 bytes of memory times the number specified. Increasing this value can reduce latency of any metadata operation by performing a hot cache access to inode information instead of an I/O to get inode info from disk, about 100 to 1000 times faster. It is especially important to increase this setting if metadata I/O latency is high, (for example, more than 2ms average latency). You should try to size this according to the sum number of working set files for all clients. Optimal settings for InodeCacheSize
range from 16K to 128K for a new file system and can be increased to 256K or 512K as a file system grows. A higher setting is more effective if the CPU is not heavily loaded. For best performance, the InodeCacheSize
should be at least 1024 times the number of megabytes allocated to the journal. For example, for a 64MB journal, the inodeCacheSize
should be at least 64K.
Example (Linux)
<inodeCacheSize>131072</inodeCacheSize>
Example (Windows)
InodeCacheSize 128K
In StorNext, the default value for the InodeCacheSize
parameter in the file system configuration file changed from 32768 to 131072. While uncommon, if a file system configuration file is missing this parameter, the new value will be in effect. This may improve performance; however, the FSM process will use more memory than it did with previous releases (up to 400 MB).
To avoid the increased memory utilization, the InodeCacheSize
parameter may be added to the file system configuration file with the old default values.
Example (Linux)
<inodeCacheSize>32768</inodeCacheSize>
Example (Windows)
InodeCacheSize 32K

Beginning with StorNext 5, all SNFS file systems use a File System Block Size of 4KB. This is the optimal value and is no longer tunable. Any file systems created with pre-5 versions of StorNext having larger File System Block Sizes will be automatically converted to use 4KB the first time the file system is started with StorNext 5.

Quantum recommends you set the JournalSize
to 256 megabytes (MB).
Increasing the JournalSize
beyond 256 MB may be beneficial for workloads where many large size directories are being created, or removed at the same time. For example, workloads dealing with 100 thousand files in a directory and several directories at once experience improved throughput with a larger journal.
The downside of a larger journal size is potentially longer FSM startup and failover times.
If you use a value less than 256 MB, then your failover time might be improved, but your file system performance might be reduced. Quantum recommends you do not set the JournalSize
to a value less than 16 MB.
Note: Journal replay has been optimized, so a 256 MB journal often replays significantly faster than a 16 MB journal.
A file system created with a release prior to StorNext 5 might have been configured with a small JournalSize
. This is true for file systems created on Windows MDCs where the old default size of the journal was 4 MB. Journals of this size continue to function with StorNext 5.x, but experience a performance benefit if the size is increased to 256 MB. You can adjust the setting by using the cvupdatefs utility. For more information, see the command cvupdatefs
in the StorNext MAN Pages Reference Guide.
If a file system previously had been configured with a JournalSize
larger than 256 MB, there is no reason to reduce it to 256 MB when upgrading to a current release of StorNext.