The Underlying Storage System

The performance characteristics of the underlying storage system are the most critical factors for file system performance. Depending on an environment’s use cases, differing performance characteristics may be more important. For very large, sequential, file access, storage throughput will be an important factor. For smaller file access, or random file access, storage access time and I/O latency will be an important factor.

Metadata access is small, random I/O, with most I/O 4KB in size. As such, storage access time and I/O latency are the key factors when tuning for StorNext file operation performance.

Solid state drive, or SSD, has shown advantages when optimizing storage for metadata performance due to its very low I/O latency and high rate of operations per second. Choosing solid state drive within RAID storage provides a good mix of resiliency and small, random I/O performance.

Typically, RAID storage systems provide many tuning options for cache settings, RAID level, segment size, stripe size, and so on.

RAID Cache Configuration

The single most important RAID tuning component is the cache configuration. This is particularly true for small I/O operations. Contemporary RAID systems provide excellent small I/O performance with properly tuned caching. So, for the best general purpose performance characteristics, it is crucial to utilize the RAID system caching as fully as possible.

For example, write-back caching is absolutely essential for metadata stripe groups to achieve high metadata operations throughput.

However, there are a few drawbacks to consider as well. For example, read-ahead caching improves sequential read performance but might reduce random performance. Write-back caching is critical for small write performance but may limit peak large I/O throughput.

Caution: Some RAID systems cannot safely support write-back caching without risk of data loss, which is not suitable for critical data such as file system metadata.

Consequently, this is an area that requires an understanding of application I/O requirements. As a general rule, RAID system caching is critically important for most applications, so it is the first place to focus tuning attention.

RAID Write-Back Caching

Write-back caching dramatically reduces latency in small write operations. This is accomplished by returning a successful reply as soon as data is written into RAID cache, thus allowing the RAID to immediately acknowledge completion of the write I/O operation as soon as the data has been captured into the RAID's cache. Simultaneous to write into cache operations, the RAID writes previously cached data onto the targeted disk LUN storage. The result is minimal I/O latency and thus great performance improvement for small write I/O operations.

Many contemporary RAID systems protect against write-back cache data loss due to power or component failure. This is accomplished through various techniques including redundancy, battery backup, battery-backed memory, and controller mirroring. To prevent data corruption, it is important to ensure that these systems are working properly. It is particularly catastrophic if file system metadata is corrupted, because complete file system loss could result.

Caution: If the array uses write-back caching, Quantum requires that the cache is battery-backed.

Minimal I/O latency is critically important to file system performance whenever the file system processes a large number of files of smaller file sizes. Each file processed requires a metadata small write operation and as discussed above many small write operations I/O latency is improved with RAID write-Back caching enabled. This is easily observed in the hourly File System Manager (FSM) statistics reports in qustats log files: the “PIO Write HiPri” statistic reports average, minimum, and maximum write latency (in microseconds) for the reporting period. If the observed average latency exceeds 0.5 milliseconds, peak metadata operation throughput will be degraded. For example, create operations may be around 2000 per second when metadata disk latency is below 0.5 milliseconds. However, create operations may fall to less than 200 per second when metadata disk latency is around 5 milliseconds.

In contrast to Write-Back caching, Write-Through caching eliminates use of the cache for writes. This approach involves synchronous writes to the physical disk before returning a successful reply for the I/O operation. The write-through approach exhibits much worse latency than write-back caching; therefore, small I/O performance (such as metadata operations) is severely impacted. It is important to determine which write caching approach is employed, because the performance observed will differ greatly for small write I/O operations.

In most cases, enabling Write-Back RAID caching improves file system performance regardless of whether small or large file sizes are being processed. However, in rare instances for some customers, depending on the type of data and RAID equipment and when larger file sizes are being processed, disabling RAID caching maximizes SNFS file system performance.

Most dual controller disk arrays typically use a "write cache mirroring" mechanism to protect against a controller failure. The "write cache mirroring" mechanism is important to ensure data integrity when failover is enabled. However, there is typically a performance impact when using "write cache mirroring". The performance impact varies greatly depending on the I/O workload. Depending on the customers’ performance and reliability needs, some customers disable "write cache mirroring" in the array controller’s cache settings; disabling "write cache mirroring" can subject the array to both single points of failure as well as data corruption. Quantum’s best practice is to enable "write cache mirroring". For LUNs containing metadata, "write cache mirroring" must always be enabled.

Kinds of Stripe Groups

StorNext uses Stripe Groups to separate data with different characteristics onto different LUNs. Every StorNext file system has three kinds of Stripe Groups.

Metadata Stripe Groups hold the file system metadata: the file name and attributes for every file in the file system. Metadata is typically very small and accessed in a random pattern.

Journal Stripe Groups hold the StorNext Journal: the sequential changes to the file system metadata. Journal data is typically a series of small sequential writes and reads.

User Data Stripe Groups hold the content of files. User data access patterns depend heavily on the customer workflow, but typical StorNext use is of large files sequentially read and written. Users can define multiple User Data Stripe Groups with different characteristics and assign data to those Stripe Groups with Affinities; see StorNext File System Stripe Group Affinity.

Because the typical access patterns for Metadata and User Data are different, Quantum recommends creating different Stripe Groups for Metadata and User Data. Journal data access patterns are similar enough to be placed on the Metadata Stripe Group, or Journal can be placed on its own Stripe Group.

RAID Level

Configuration settings such as RAID level, segment size, and stripe size are very important and cannot be changed after put into production, so it is critical to determine appropriate settings during initial configuration.

Quantum recommends Metadata and Journal Strips Groups use RAID 1 because it is most optimal for very small I/O sizes. Quantum recommends using fibre channel or SAS disks (as opposed to SATA) for metadata and journal due to the higher IOPS performance and reliability. It is also very important to allocate entire physical disks for the Metadata and Journal LUNs in order to avoid bandwidth contention with other I/O traffic. Metadata and Journal storage requires very high IOPS rates (low latency) for optimal performance, so contention can severely impact IOPS (and latency) and thus overall performance. If Journal I/O exceeds 1ms average latency, you will observe significant performance degradation.

Note: For Metadata, RAID 1 works well, but RAID 10 (a stripe of mirrors) offers advantages. If IOPS is the primary need of the file system, RAID 10 supports additional performance by adding additional mirror pairs to the stripe. (The minimum is 4 disks, but 6 or 8 are possible). While RAID 1 has the performance of one drive (or slightly better than one drive), RAID 10 offers the performance of RAID 0 and the security of RAID 1. This suits the small and highly random nature of metadata.

Quantum recommends User Data Stripe Groups use RAID 5 for high throughput, with resilience in case of disk error. A 4+1 RAID 5 group would logically process data on four disks, and another disk for parity.

Some storage vendors now provide RAID 6 capability for improved reliability over RAID 5. This may be particularly valuable for SATA disks where bit error rates can lead to disk problems. However, RAID 6 typically incurs a performance penalty compared to RAID 5, particularly for writes. Check with your storage vendor for RAID 5 versus RAID 6 recommendations.

Segment Size and Stripe Size

The stripe size is the sum of the segment sizes of the data disks in the RAID group. For example, a 4+1 RAID 5 group (four data disks plus one parity) with 64kB segment sizes creates a stripe group with a 256kB stripe size. The stripe size is a critical factor for write performance. Writes smaller than the stripe size incur the read/modify/write penalty, described more fully below. Quantum recommends a stripe size of 512kB or smaller.

The RAID stripe size configuration should typically match the SNFS StripeBreadth configuration setting when multiple LUNs are utilized in a stripe group. However, in some cases it might be optimal to configure the SNFS StripeBreadth as a multiple of the RAID stripe size, such as when the RAID stripe size is small but the user's I/O sizes are very large. However, this will be suboptimal for small I/O performance, so may not be suitable for general purpose usage.

To help the reader visualize the read/modify/write penalty, it may be helpful to understand that the RAID can only actually write data onto the disks in a full stripe sized packet of data. Write operations to the RAID that are not an exact fit of one or more stripe-sized segments, requires that the last, or only, stripe segment be read first from the disks. Then the last, or only portion, of the write data is overlaid onto the read stripe segment. Finally, the data is written back out onto the RAID disks in a single full stripe segment. When RAID caching has been disabled (no Write-Back caching), these read/modify/write operations will require a read of the stripe data segment into host memory before the data can be properly merged and written back out. This is the worst case scenario from a performance standpoint. The read/modify/write penalty is most noticeable in the absence of “write-back” caching being performed by the RAID controller.

It can be useful to use a tool such as lmdd to help determine the storage system performance characteristics and choose optimal settings. For example, varying the stripe size and running lmdd with a range of I/O sizes might be useful to determine an optimal stripe size multiple to configure the SNFS StripeBreadth.

The deviceparams File

This file is used to control the I/O scheduler, and control the scheduler's queue depth.

For more information about this file, see the deviceparams man page, or the StorNext Man Pages Reference Guide posted here (click the “Select a StorNext Version” menu to view the desired documents):

http://www.quantum.com/sn5docs

The I/O throughput of Linux Kernel 2.6.10 (SLES10 and later and RHEL5 and later) can be increased by adjusting the default I/O settings.

Note: SLES 10 is not supported in StorNext 5.

Beginning with the 2.6 kernel, the Linux I/O scheduler can be changed to control how the kernel does reads and writes. There are four types of I/O scheduler available in most versions of Linux kernel 2.6.10 and higher:

The completely fair queuing scheduler (CFQ)

The no operation scheduler (NOOP)

The deadline scheduler (DEADLINE)

The anticipatory scheduler (ANTICIPATORY)

Note: ANTICIPATORY is not present in SLES 11 SP2.

The default scheduler in most distributions is the completely fair queuing (CFQ). Experimentation displays that the deadline scheduler provides the best improvement.

Increasing the number of outstanding requests has been shown to provide a performance benefit:

nr_requests=4096

In addition, there are three Linux kernel parameters that can be tuned for increased performance:

The minimal preemption qranularity variable for CPU bound tasks.

kernel.sched_min_granularity_ns = 10ms

echo 10000000 > /proc/sys/kernel/sched_min_granularity_ns

The wake-up preemption qranularity variable. Increasing this variable reduces wake-up preemption, reducing disturbance of computer bound tasks. Lowering it improves wake-up latency and throughput for latency of critical tasks.

kernel.sched_wakeup_granularity_ns = 15ms

echo 15000000 > /proc/sys/kernel/sched_wakeup_granularity_ns

3. The vm.dirty_background_ratio variable contains 10, which is a percentage of total system memory, the number of pages at which the pbflush background writeback daemon will start writing out dirty data. However, for fast RAID based disk system, this may cause large flushes of dirty memory pages. Increasing this value will result in less frequent flushes.

vm.dirty_ratio = 40% RAM

sysctl vm.dirty_background_ratio = 40

For additional details, see the command deviceparams(4) in the StorNext MAN Pages Reference Guide and also see StorNext Product Bulletin 50.