The Metadata Controller System

Stripe Groups

Splitting apart data, metadata, and journal into separate stripe groups is usually the most important performance tactic. The create, remove, and allocate (for example, write) operations are very sensitive to I/O latency of the journal stripe group. However, if create, remove, and allocate performance aren't critical, it is okay to share a stripe group for both metadata and journal, but be sure to set the exclusive property on the stripe group so it does not get allocated for data as well.

Note: It is recommended that you have only a single metadata stripe group. For increased performance, use multiple LUNs (2 or 4) for the stripe group.

RAID 1 mirroring is optimal for metadata and journal storage. Utilizing the write-back caching feature of the RAID system (as described previously) is critical to optimizing performance of the journal and metadata stripe groups. Quantum recommends mapping no more than one LUN per RAID 1 set.

Example (Linux)

<stripeGroup index="0" name="MetaFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="true" journal="false" userdata="false" realTimeIOs="200" realTimeIOsReserve="1" realTimeMB="200" realTimeMBReserve="1" realTimeTokenTimeout="0" multipathMethod="rotate">

<disk index="0" diskLabel="CvfsDisk0" diskType="MetaDrive"/>

</stripeGroup>

<stripeGroup index="1" name="JournFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="false" journal="true" userdata="false" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">

<disk index="0" diskLabel="CvfsDisk1" diskType="JournalDrive"/>

</stripeGroup>

<stripeGroup index="4" name="RegularFiles" status="up" stripeBreadth="262144" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">

<disk index="0" diskLabel="CvfsDisk14" diskType="DataDrive"/>

<disk index="1" diskLabel="CvfsDisk15" diskType="DataDrive"/>

<disk index="2" diskLabel="CvfsDisk16" diskType="DataDrive"/>

<disk index="3" diskLabel="CvfsDisk17" diskType="DataDrive"/>

</stripeGroup>

Example (Windows)

[StripeGroup MetaFiles]

Status Up

StripeBreadth 256K

Metadata Yes

Journal No

Exclusive Yes

Read Enabled

Write Enabled

Rtmb 200

Rtios 200

RtmbReserve 1

RtiosReserve 1

RtTokenTimeout 0

MultiPathMethod Rotate

Node CvfsDisk0 0

[StripeGroup JournFiles]

Status Up

StripeBreadth 256K

Metadata No

Journal Yes

Exclusive Yes

Read Enabled

Write Enabled

Rtmb 0

Rtios 0

RtmbReserve 0

RtiosReserve 0

RtTokenTimeout 0

MultiPathMethod Rotate

Node CvfsDisk1 0

[StripeGroup RegularFiles]

Status Up

StripeBreadth 256K

Metadata No

Journal No

Exclusive No

Read Enabled

Write Enabled

Rtmb 0

Rtios 0

RtmbReserve 0

RtiosReserve 0

RtTokenTimeout 0

MultiPathMethod Rotate

Node CvfsDisk14 0

Node CvfsDisk15 1

Node CvfsDisk16 2

Node CvfsDisk17 3

Affinities

Affinities are another stripe group feature that can be very beneficial. Affinities can direct file allocation to appropriate stripe groups according to performance requirements. For example, stripe groups can be set up with unique hardware characteristics such as fast disk versus slow disk, or wide stripe versus narrow stripe. Affinities can then be employed to steer files to the appropriate stripe group.

For optimal performance, files that are accessed using large DMA-based I/O could be steered to wide-stripe stripe groups. Less performance-critical files could be steered to slow disk stripe groups. Small files could be steered clear of large files, or to narrow-stripe stripe groups.

Example (Linux)

<stripeGroup index="3" name="AudioFiles" status="up" stripeBreadth="1048576" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">

<affinities exclusive="true">

<affinity>Audio</affinity>

</affinities>

<disk index="0" diskLabel="CvfsDisk10" diskType="AudioDrive"/>

<disk index="1" diskLabel="CvfsDisk11" diskType="AudioDrive"/>

<disk index="2" diskLabel="CvfsDisk12" diskType="AudioDrive"/>

<disk index="3" diskLabel="CvfsDisk13" diskType="AudioDrive"/>

</stripeGroup>

Example (Windows)

[StripeGroup AudioFiles]

Status Up

StripeBreadth 1M

Metadata No

Journal No

Exclusive Yes

Read Enabled

Write Enabled

Rtmb 0

Rtios 0

RtmbReserve 0

RtiosReserve 0

RtTokenTimeout 0

MultiPathMethod Rotate

Node CvfsDisk10 0

Node CvfsDisk11 1

Node CvfsDisk12 2

Node CvfsDisk13 3

Affinity Audio

Note: Affinity names cannot be longer than eight characters.

StripeBreadth

This setting should match the RAID stripe size or be a multiple of the RAID stripe size. Matching the RAID stripe size is usually the most optimal setting. However, depending on the RAID performance characteristics and application I/O size, it might be beneficial to use a multiple or integer fraction of the RAID stripe size. For example, if the RAID stripe size is 256K, the stripe group contains 4 LUNs, and the application to be optimized uses DMA I/O with 8MB block size, a StripeBreadth setting of 2MB might be optimal. In this example the 8MB application I/O is issued as 4 concurrent 2MB I/Os to the RAID. This concurrency can provide up to a 4X performance increase. This StripeBreadth typically requires some experimentation to determine the RAID characteristics. The lmdd utility can be very helpful. Note that this setting is not adjustable after initial file system creation.

Optimal range for the StripeBreadth setting is 128K to multiple megabytes, but this varies widely.

Note: This setting cannot be changed after being put into production, so its important to choose the setting carefully during initial configuration.

Example (Linux)

<stripeGroup index="2" name="VideoFiles" status="up" stripeBreadth="4194304" read="true" write="true" metadata="false" journal="false" userdata="true" realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0" realTimeMBReserve="0" realTimeTokenTimeout="0" multipathMethod="rotate">

<affinities exclusive="true">

<affinity>Video</affinity>

</affinities>

<disk index="0" diskLabel="CvfsDisk2" diskType="VideoDrive"/>

<disk index="1" diskLabel="CvfsDisk3" diskType="VideoDrive"/>

<disk index="2" diskLabel="CvfsDisk4" diskType="VideoDrive"/>

<disk index="3" diskLabel="CvfsDisk5" diskType="VideoDrive"/>

<disk index="4" diskLabel="CvfsDisk6" diskType="VideoDrive"/>

<disk index="5" diskLabel="CvfsDisk7" diskType="VideoDrive"/>

<disk index="6" diskLabel="CvfsDisk8" diskType="VideoDrive"/>

<disk index="7" diskLabel="CvfsDisk9" diskType="VideoDrive"/>

</stripeGroup>

Example (Windows)

[StripeGroup VideoFiles]

Status Up

StripeBreadth 4M

Metadata No

Journal No

Exclusive Yes

Read Enabled

Write Enabled

Rtmb 0

Rtios 0

RtmbReserve 0

RtiosReserve 0

RtTokenTimeout 0

MultiPathMethod Rotate

Node CvfsDisk2 0

Node CvfsDisk3 1

Node CvfsDisk4 2

Node CvfsDisk5 3

Node CvfsDisk6 4

Node CvfsDisk7 5

Node CvfsDisk8 6

Node CvfsDisk9 7

Affinity Video

BufferCacheSize

Increasing this value can reduce latency of any metadata operation by performing a hot cache access to directory blocks, inode information, and other metadata info. This is about 10 to 1000 times faster than I/O. It is especially important to increase this setting if metadata I/O latency is high, (for example, more than 2ms average latency). Quantum recommends sizing this according to how much memory is available; more is better. Optimal settings for BufferCacheSize range from 32MB to 8GB for a new file system and can be increased up to 500GB as a file system grows. A higher setting is more effective if the CPU is not heavily loaded.

When the value of BufferCacheSize is greater than 1GB, SNFS uses compressed buffers to maximize the amount of cached metadata. The effective size of the cache is as follows:

If the BufferCacheSize is less than or equal to 1GB, then:

Effective Cache Size = BufferCacheSize

If the BufferCacheSize is greater than 1GB, then:

Effective Cache Size = (BufferCacheSize - 512MB) * 2.5

The value 2.5 in the above formula represents a typical level of compression. This factor may be somewhat lower or higher, depending on the complexity of the file system metadata.

Note: Configuring a large value of BufferCacheSize will increase the memory footprint of the FSM process. If this process crashes, a core file will be generated that will consume disk space proportional to its size.

Example (Linux)

<bufferCacheSize>268435456</bufferCacheSize>

Example (Windows)

BufferCacheSize 256MB

In StorNext, the default value for the BufferCacheSize parameter in the file system configuration file changed from 32 MB to 256 MB. While uncommon, if a file system configuration file is missing this parameter, the new value will be in effect. This may improve performance; however, the FSM process will use more memory than it did with previous releases (up to 400 MB).

To avoid the increased memory utilization, the BufferCacheSize parameter may be added to the file system configuration file with the old default values.

Example (Linux)

<bufferCacheSize>33554432</bufferCacheSize>

Example (Windows)

BufferCacheSize 32M

InodeCacheSize

This setting consumes about 1400 bytes of memory times the number specified. Increasing this value can reduce latency of any metadata operation by performing a hot cache access to inode information instead of an I/O to get inode info from disk, about 100 to 1000 times faster. It is especially important to increase this setting if metadata I/O latency is high, (for example, more than 2ms average latency). You should try to size this according to the sum number of working set files for all clients. Optimal settings for InodeCacheSize range from 16K to 128K for a new file system and can be increased to 256K or 512K as a file system grows. A higher setting is more effective if the CPU is not heavily loaded. For best performance, the InodeCacheSize should be at least 1024 times the number of megabytes allocated to the journal. For example, for a 64MB journal, the inodeCacheSize should be at least 64K.

Example (Linux)

<inodeCacheSize>131072</inodeCacheSize>

Example (Windows)

InodeCacheSize 128K

In StorNext, the default value for the InodeCacheSize parameter in the file system configuration file changed from 32768 to 131072. While uncommon, if a file system configuration file is missing this parameter, the new value will be in effect. This may improve performance; however, the FSM process will use more memory than it did with previous releases (up to 400 MB).

To avoid the increased memory utilization, the InodeCacheSize parameter may be added to the file system configuration file with the old default values.

Example (Linux)

<inodeCacheSize>32768</inodeCacheSize>

Example (Windows)

InodeCacheSize 32K

FsBlockSize

Beginning with StorNext 5, all SNFS file systems use a File System Block Size of 4KB. This is the optimal value and is no longer tunable. Any file systems created with pre-5 versions of StorNext having larger File System Block Sizes will be automatically converted to use 4KB the first time the file system is started with StorNext 5.

JournalSize

Quantum recommends you set the JournalSize to 256 megabytes (MB).

Increasing the JournalSize beyond 256 MB may be beneficial for workloads where many large size directories are being created, or removed at the same time. For example, workloads dealing with 100 thousand files in a directory and several directories at once experience improved throughput with a larger journal.

The downside of a larger journal size is potentially longer FSM startup and failover times.

If you use a value less than 256 MB, then your failover time might be improved, but your file system performance might be reduced. Quantum recommends you do not set the JournalSize to a value less than 16 MB.

Note: Journal replay has been optimized, so a 256 MB journal often replays significantly faster than a 16 MB journal.

A file system created with a release prior to StorNext 5 might have been configured with a small JournalSize. This is true for file systems created on Windows MDCs where the old default size of the journal was 4 MB. Journals of this size continue to function with StorNext 5.x, but experience a performance benefit if the size is increased to 256 MB. You can adjust the setting by using the cvupdatefs utility. For more information, see the command cvupdatefs in the StorNext MAN Pages Reference Guide.

If a file system previously had been configured with a JournalSize larger than 256 MB, there is no reason to reduce it to 256 MB when upgrading to a current release of StorNext.

Example (Linux)

Example (Windows)

JournalSize 256M