How ASR Works

Allocation requests (which occur whenever a file is written to an area that has no actual disk space allocated,) are grouped into sessions. A chunk of space is reserved for a session. The size of the chunk is determined using the configured size and the size of the allocation request. If the allocation size is bigger than 1MB and smaller than 1/8th the configured ASR chunk size, the ASR chunk size is rounded up to be a multiple of the initial allocation request size.

There are three session types: small, medium (directory), and large (file). The session type is determined by the file offset and requested allocation size on a given allocation request.

Small sessions are for sizes (offset + allocation size) smaller than 1MB.
Medium sessions are for sizes 1MB through 1/10th of the configured ASR size.
Large sessions are sizes bigger than medium.

Here is another way to think of these three types: small sessions collect or organize all small files into small session chunks; medium sessions collect medium-sized files by chunks using their parent directory; and large file allocations are collected into their own chunks and are allocated independently of other files.

All sessions are client specific. Multiple writers to the same directory or large file on different clients will use different sessions. Small files from different clients use different chunks by client.

Small sessions use a smaller chunk size than the configured size. The small chunk size is determined by dividing the configured size by 32.

For example, for 128 MB the small chunk size is 4 MB, and for 1 GB the small chunk size is 32 MB. Small sessions do not round the chunk size. A file can get an allocation from a small session only if the allocation request (offset + size) is less than 1MB. When users do small I/O sizes into a file, the client buffer cache coalesces these and minimizes allocation requests. If a file is larger than 1MB and is being written through the buffer cache, it will most likely have allocation on the order of 16MB or so requests (depending on the size of the buffer cache on the client and the number of concurrent users of that buffer cache).

With NFS I/O into a StorNext client, the StorNext buffer cache is used. NFS on some operating systems breaks I/O into multiple streams per file. These will arrive on the StorNext client as disjointed random writes. These are typically allocated from the same session with ASR and are not impacted if multiple streams (other files) allocate from the same stripe group. ASR can help reduce fragmentation due to these separate NFS generated streams.

Files can start using one session type and then move to another session type. A file can start with a very small allocation (small session), become larger (medium session), and end up reserving the session for the file. If a file has more than 10% of a medium sized chunk, it “reserves” the remainder of the session chunk it was using for itself. After a session is reserved for a file, a new session segment will be allocated for any other medium files in that directory.

Small chunks are never reserved.

When allocating subsequent pieces for a session, they are rotated around to other stripe groups that can hold user data unless InodeStripeWidth (ISW) is set to 0.

Note: In StorNext, rotation is not done if InodeStripeWidth is set to 0.

When InodeStripeWidth is set, chunks are rotated in a similar fashion to InodeStripeWidth. The direction of rotation is determined by a combination of the session key and the index of the client in the client table. The session key is based on the inode number so odd inodes will rotate in a different direction from even inodes. Directory session keys are based on the inode number of the parent directory. For additional information about InodeStripeWidth, refer to the snfs_config(5) man page.

Video Frame Per File Formats

Video applications typically write one frame per file and place them in their own unique directory, and then write them from the same StorNext client. The file sizes are all greater than 1MB and smaller than 50 MB each and written/allocated in one I/O operation. Each file and write land in “medium/directory” sessions.

For this kind of workflow, ASR is the ideal method to keep “streams” (a related collection of frames in one directory) together on disk, thereby preventing checker boarding between multiple concurrent streams. In addition, when a stream is removed, the space can be returned to the free space pool in big ASR pieces, reducing free space fragmentation when compared to the default allocator.

Hotspots and Locality

Suppose a file system has four data stripe groups and an ASR size of 1 GB. If four concurrent applications writing medium-sized files in four separate directories are started, they will each start with their own 1 GB piece and most likely be on different stripe groups.

Without ASR

Without ASR, the files from the four separate applications are intermingled on disk with the files from the other applications. The default allocator does not consider the directory or application in any way when carving out space. All allocation requests are treated equally. With ASR turned off and all the applications running together, any hotspot is very short lived: the size of one allocation/file. (See the following section for more information about hotspots.)

With ASR

Now consider the 4 GB chunks for the four separate directories. As the chunks are used up, ASR allocates chunks on a new SG using rotation. Given this rotation and the timings of each application, there are times when multiple writers/segments will be on a particular stripe group together. This is considered a “hotspot,” and if the application expects more throughput than the stripe group can provide, performance will be sub par.

At read time, the checker boarding on disk from the writes (when ASR is off) can cause disk head movement, and then later the removal of one application run can also cause free space fragmentation. Since ASR collects the files together for one application, the read performance of one application's data can be significantly better since there will be little to no disk head movement.

Small Session Rationale

Small files (those less than 1 MB) are placed together in small file chunks and grouped by StorNext client ID. This was done to help use the leftover pieces from the ASR size chunks and to keep the small files away from medium files. This reduces free space fragmentation over time that would be caused by the leftover pieces. Leftover pieces occur in some rare cases, such as when there are many concurrent sessions exceeding 500 sessions.

Large File Sessions and Medium Session Reservation

When an application starts writing a very large file, it typically starts writing in some units and extending the file size. For this scenario, assume the following:

ASR is turned on, and the configured size is 1 GB.
The application is writing in 2 MB chunks and writing a 10 GB file.
ISW is set to 1 GB.

On the first I/O (allocation), an ASR session is created for the directory (if one does not already exist,) and space is either stolen from an expired session or a new 1 GB piece is allocated on some stripe group.

When the file size plus the request allocation size passes 100 MB, the session will be converted from a directory session to a file-specific session and reserved for this file. When the file size surpasses the ASR size, chunks are reserved using the ISW configured size.

Returning to our example, the extents for the 10 GB file should start with a 1 GB extent (assuming the first chunk was not stolen and a partial), and the remaining extents except the last one should all be 1 GB.

The following is an example of extent layout from one process actively writing in it's own directory as described above:

root@per2:() -> snfsdefrag -e 10g.lmdd

10g.lmdd:

# group frbase fsbase fsend kbytes depth

0 3 0x0 0xdd4028 0xde4027 1048576 1

1 4 0x40000000 0xdd488a 0xde4889 1048576 1

2 1 0x80000000 0x10f4422 0x1104421 1048576 1

3 2 0xc0000000 0x20000 0x2ffff 1048576 1

4 3 0x100000000 0xd34028 0xd44027 1048576 1

5 4 0x140000000 0xd9488a 0xda4889 1048576 1

6 1 0x180000000 0x10c4422 0x10d4421 1048576 1

7 2 0x1c0000000 0x30000 0x3ffff 1048576 1

8 3 0x200000000 0x102c028 0x103c027 1048576 1

9 4 0x240000000 0xd6c88a 0xd7c889 1048576 1

Below are the extent layouts of two processes writing concurrently but in their own directory:

root@per2:() -> lmdd of=1d/10g bs=2m move=10g & lmdd of=2d/10g bs=2m move=10g &

[1] 27866

[2] 27867

root@per2:() -> wait

snfsdefrag -e 1d/* 2d/*

10240.00 MB in 31.30 secs, 327.14 MB/sec

[1]- Done lmdd of=1d/10g bs=2m move=10g

10240.00 MB in 31.34 secs, 326.74 MB/sec

[2]+ Done lmdd of=2d/10g bs=2m move=10g

root@per2:() ->

root@per2:() -> snfsdefrag -e 1d/* 2d/*

1d/10g:

# group frbase fsbase fsend kbytes depth

0 1 0x0 0xf3c422 0xf4c421 1048576 1

1 4 0x40000000 0xd2c88a 0xd3c889 1048576 1

2 3 0x80000000 0xfcc028 0xfdc027 1048576 1

3 2 0xc0000000 0x50000 0x5ffff 1048576 1

4 1 0x100000000 0x7a0472 0x7b0471 1048576 1

5 4 0x140000000 0xc6488a 0xc74889 1048576 1

6 3 0x180000000 0xcd4028 0xce4027 1048576 1

7 2 0x1c0000000 0x70000 0x7ffff 1048576 1

8 1 0x200000000 0x75ef02 0x76ef01 1048576 1

9 4 0x240000000 0xb9488a 0xba4889 1048576 1

2d/10g:

# group frbase fsbase fsend kbytes depth

0 2 0x0 0x40000 0x4ffff 1048576 1

1 3 0x40000000 0xffc028 0x100c027 1048576 1

2 4 0x80000000 0xca488a 0xcb4889 1048576 1

3 1 0xc0000000 0xedc422 0xeec421 1048576 1

4 2 0x100000000 0x60000 0x6ffff 1048576 1

5 3 0x140000000 0xea4028 0xeb4027 1048576 1

6 4 0x180000000 0xc2c88a 0xc3c889 1048576 1

7 1 0x1c0000000 0x77f9ba 0x78f9b9 1048576 1

8 2 0x200000000 0x80000 0x8ffff 1048576 1

9 3 0x240000000 0xbe4028 0xbf4027 1048576 1

Finally, consider two concurrent writers in the same directory on the same client writing 10 GB files. The files will checker board until they reach 100 MBs. After that, each file will have its own large session and the checker boarding will cease.

Below is an example of two 5 GB files written in the same directory at the same time with 2MB I/Os. The output is from the snfsdefrag -e <file> command.

First Example

# group frbase fsbase fsend kbytes depth

0 1 0x0 0x18d140 0x18d23f 4096 1

1 1 0x400000 0x18d2c0 0x18d33f 2048 1

2 1 0x600000 0x18d3c0 0x18d43f 2048 1

3 1 0x800000 0x18d4c0 0x18d53f 2048 1

4 1 0xa00000 0x18d5c0 0x18d73f 6144 1

5 1 0x1000000 0x18d7c0 0x18d83f 2048 1

6 1 0x1200000 0x18d8c0 0x18d9bf 4096 1

7 1 0x1600000 0x18dbc0 0x18dcbf 4096 1

8 1 0x1a00000 0x18dfc0 0x18e4bf 20480 1

9 1 0x2e00000 0x18e8c0 0x18e9bf 4096 1

10 1 0x3200000 0x18eac0 0x18ebbf 4096 1

11 1 0x3600000 0x18ecc0 0x18f3bf 28672 1

12 1 0x5200000 0x18f9c0 0x18fdbf 16384 1

13 1 0x6200000 0x1901c0 0x19849f 536064 1

14 3 0x26d80000 0x1414028 0x1424027 1048576 1

15 4 0x66d80000 0x150f092 0x151f091 1048576 1

16 1 0xa6d80000 0x10dc6e 0x11dc6d 1048576 1

17 3 0xe6d80000 0x1334028 0x1344027 1048576 1

18 4 0x126d80000 0x8f74fe 0x8fd99d 412160 1

Second Example

# group frbase fsbase fsend kbytes depth

0 1 0x0 0x18d0c0 0x18d13f 2048 1

1 1 0x200000 0x18d240 0x18d2bf 2048 1

2 1 0x400000 0x18d340 0x18d3bf 2048 1

3 1 0x600000 0x18d440 0x18d4bf 2048 1

4 1 0x800000 0x18d540 0x18d5bf 2048 1

5 1 0xa00000 0x18d740 0x18d7bf 2048 1

6 1 0xc00000 0x18d840 0x18d8bf 2048 1

7 1 0xe00000 0x18d9c0 0x18dbbf 8192 1

8 1 0x1600000 0x18dcc0 0x18dfbf 12288 1

9 1 0x2200000 0x18e4c0 0x18e8bf 16384 1

10 1 0x3200000 0x18e9c0 0x18eabf 4096 1

11 1 0x3600000 0x18ebc0 0x18ecbf 4096 1

12 1 0x3a00000 0x18f3c0 0x18f9bf 24576 1

13 1 0x5200000 0x18fdc0 0x1901bf 16384 1

14 4 0x6200000 0x1530772 0x1540771 1048576 1

15 3 0x46200000 0x1354028 0x1364027 1048576 1

16 1 0x86200000 0x12e726 0x13e725 1048576 1

17 4 0xc6200000 0x14ed9b2 0x14fd9b1 1048576 1

18 3 0x106200000 0x1304028 0x13127a7 948224 1

Note: Beginning with StorNext 6, use the sgoffload command instead of the snfsdefrag command. The sgoffload command moves extents belonging to files that are currently in use (open). The sgoffload command also informs the client to suspend I/O for a time, moves the data, then informs the client to refresh the location of the data and resume I/O.

Without ASR and with concurrent writers of big files, each file typically starts on its own stripe group. The checker boarding does not occur until there are more writers than the number of data stripe groups. However, once the checker boarding starts, it will exist all the way through the file. For example, if we have two data stripe groups and four writers, all four files would checker board until the number of writers is reduced back to two or less.