Optimistic Allocation
Note: It is no longer recommended that you change the InodeExpand
parameters (InodeExpandMin
, InodeExpandMax
and InodeExpandInc
) from their default values. These settings are provided for compatibility when upgrading file systems.
The InodeExpand
values are still honored if they are in the .cfgx file, but the StorNext GUI does not allow you to set these values. Also, when converting from .cfg to .cfgx files, if the InodeExpand
values in the .cfg file are found to be the default example values, these values are not set in the new .cfgx. Instead, the Optimistic Allocation Formula is used.
The InodeExpand
values come into play whenever a write to disk is done, and works as an "optimistic allocator". It is referred to as “optimistic” because it works under the assumption that where there is one allocation, there will be another, so it allocates more than you asked for believing that you'll use the over-allocated space soon.
There are three ways to do a DMA I/O:
- By having an I/O larger than
auto_dma_write_length
(orauto_dma_read_length
, but that does not cause an allocation so it will be ignored for this case) - Doing a write to a file that was opened with
O_DIRECT
- Opening a file for writes that's already open for writes by another client (commonly referred to as "shared write mode" which requires all I/Os go straight to disk to maintain coherency between the clients)
The first allocation is the larger of the InodeExpandMin
or the actual IO size. For example, if the InodeExpandMin
is 2 MB and the initial IO is 1 MB, the file gets a 2 MB allocation. However, if the initial IO was 3 MB and the InodeExpandMin
is 2 MB, the file gets only a 3 MB allocation.
In both cases, the InodeExpandMin
value is saved in an internal data structure in the file's inode, to be used with subsequent allocations. Subsequent DMA IOs that require more space to be allocated for the file add to the InodeExpandInc
value saved in the inode, and the allocation is the larger of this value or the IO size.
For example, if InodeExpandMin
is 2 MB and InodeExpandInc
is 4 MB and the first I/O is 1 MB, then the file is initially 2 MB in size. On the third 1 MB I/O the file is extended by 6 MB (2 MB + 4 MB) and is now 8 MB though it only has 3 MB of data in it. However, that 6 MB allocation is likely contiguous and therefore the file has at most 2 fragments which is better than 8 fragments it would have had otherwise.
Assuming there are more 1MB I/Os to the file, it will continue to expand in this manner. The next DMA I/O requiring an allocation over the 8 MB mark will extend the file by 10 MB (2 MB + 4 MB + 4 MB). This pattern repeats until the file's allocation value is equal to or larger than InodeExpandMax
, at which point it's capped at InodeExpandMax
.
This formula generally works well when it is tuned for the specific I/O pattern. If it is not tuned, with certain I/O patterns it can cause suboptimal allocations resulting in excess fragmentation or wasted space from files being over allocated.
This is especially true if there are small files created with O_DIRECT
, or small files that are simultaneously opened by multiple clients which cause them to use an InodeExpandMin
that's too large for them. Another possible problem is an InodeExpandMax
that is too small, causing the file to be composed of fragments smaller than it otherwise could have been created with.
With very large files, without increasing InodeExpandMax
, it can create fragmented files due to the relatively small size of the allocations and the large number that are needed to create a large file.
Another possible problem is an InodeExpandInc
that is not aggressive enough, again causing a file to be created with more fragments than it could be created with, or to never reach InodeExpandMax
because writes stop before it can be incremented to that value.
Note: Although the preceding example uses DMA I/O, the InodeExpand
parameters apply to both DMA and non-DMA allocations.
The following table displays the Optimistic Allocation formula:
File Size (in bytes) |
Optimistic Allocation |
<= 16 MB |
1 MB |
16 MB to 64 MB + 4 bytes |
4 MB |
64 MB + 4 bytes to 256 MB + 16 bytes |
16 MB |
256 MBs + 16 bytes to 1 GB + 64 bytes |
64 MB |
1 GB + 64 bytes to 4 GB + 256 bytes |
256 MB |
4 GB + 256 bytes to 16 GB + 1 k bytes |
1 GB |
16 GB + 1 k bytes to 64 GB + 4 k bytes |
4 GB |
64 GB + 4 k bytes to 256 GB + 16k bytes |
16 GB |
256 GB + 16 k bytes to 1 TB + 64 k bytes |
64 GB |
1 TB + 64 k bytes or larger |
256 GB |
To examine how well these allocation strategies work in your specific environment, use the snfsdefrag
utility with the -e
option to display the individual extents (allocations) in a file.
Below is an example output from snfsdefrag -e testvideo2.mov
:
testvideo2.mov:
# group frbase fsbase fsend kbytes depth
0 7 0x0 0xa86df6 0xa86df6 16 4
1 7 0x4000 0x1fb79b0 0x1fb79e1 800 4
HOLE @ frbase 0xcc000 for 41 blocks (656 kbytes)
2 7 0x170000 0x57ca034 0x57ca03f 192 4
3 7 0x1a0000 0x3788860 0x3788867 128 4
4 7 0x1c0000 0x68f6cb4 0x68f6cff 1216 4
5 7 0x2f0000 0x70839dd 0x70839df 48 4
Note: Beginning with StorNext 6, use the sgoffload
command instead of the snfsdefrag
command. The sgoffload
command moves extents belonging to files that are currently in use (open). The sgoffload
command also informs the client to suspend I/O for a time, moves the data, then informs the client to refresh the location of the data and resume I/O.
Here is an explanation of the column headings:
- #: This is the extent index.
group
: The group column tells you which stripe group on which the extent resides. Usually it is all on the same stripe group, but not always.frbase
: This is the file's logical offsetfsbase
andfsend
: These are the StorNext logical start and end addresses and should be ignored.kbytes
: This is the size of the extent (fragment)depth
: This tells you the number of LUNs that existed in the stripe group when the file was written. If you perform bandwidth expansion, this number is the old number of LUNs before bandwidth expansion, and signifies that those files aren't taking advantage of the bandwidth expansion.
If the file is sparse, you will see "HOLE
" displayed. Having holes in a file is not necessarily a problem, but it does create extra fragments (one for each side of the hole). You can tune to eliminate holes, and as a result you reduce the fragmentation; however, you use more disk space.