I/O tuning using Mio

Section 1 - Testing block device:

The data section of a Stornext filesystem is made of Stripe group each stripe group is made of one or more lun stripe together. The recommendation is to make StripeBreath a factor of the LUN stripe size. To verify the IO performance of the lun itself we can use Mio on the block device directly.

WARNING: Writing to the block device will destroy any data on that device. Mio will try to skip over the label when writing/reading to/from a block device.

Example:

For our testing we are using a RAID5 5+1 volume with 64k chunk size. This mean that the stripeBreath we should use is (64K * 5) = 320K

Write test:

We can find the speed to expect from the device on a single path by using the following command.

# Mio -w -q 1 -n 100000 -b 320K /dev/mapper/mpath58

Mio: Timing 1 stream(s) of 100000 x 320K direct writes queued 1 deep

stream[0]: mpath58: write 32768.00 MBytes @ 391.19 MBytes/Second

Mio: Aggregate: 32768.00 Mbytes @ 391.19 MBytes/Second

This result tells us that we should not expect more than 391.19 MBytes/Second single thnread write performance on the RAID. It is to note that often we have multiple path active paths to the same LUN so we can try increase the queue_depth to see what the maximum speed we can get from the device.

# Mio -w -q 4 -n 100000 -b 320K /dev/mapper/mpath58

Mio: Timing 1 stream(s) of 100000 x 320K direct writes queued 4 deep

stream[0]: mpath58: write 32768.00 MBytes @ 724.01 MBytes/Second

Mio: Aggregate: 32768.00 Mbytes @ 724.01 MBytes/Second

From the above result we can see that the maximum speed of a single thread is more limited to the speed of a single path instead of the speed of the RAID unit. In our case we had 2 paths.

Read test:

Removing the –w option to Mio in the above example will give you a read test.

# Mio -q 1 -n 100000 -b 320K /dev/mapper/mpath58

Mio: Timing 1 stream(s) of 100000 x 320K direct reads queued 1 deep

stream[0]: mpath58: read 32768.00 MBytes @ 461.07 MBytes/Second

Mio: Aggregate: 32768.00 Mbytes @ 461.07 MBytes/Second

# Mio -q 4 -n 100000 -b 320K /dev/mapper/mpath58

Mio: Timing 1 stream(s) of 100000 x 320K direct reads queued 4 deep

stream[0]: mpath58: read 32768.00 MBytes @ 685.41 MBytes/Second

Mio: Aggregate: 32768.00 Mbytes @ 685.41 MBytes/Second

Compare with using dd:

You can also use the ‘dd’ command to get similar data however the ‘dd’ command does not have the multithread option.

# dd if=/dev/zero of=/dev/mapper/mpath58 count=100000 bs=320K oflag=direct

100000+0 records in

100000+0 records out

32768000000 bytes (33 GB) copied, 82.9461 seconds, 395 MB/s

# dd if=/dev/mapper/mpath58 of=/dev/null count=100000 bs=320K

100000+0 records in

100000+0 records out

32768000000 bytes (33 GB) copied, 82.1339 seconds, 399 MB/s

Section 2 – Testing the filesystem:

Now that we have our base number we can create our filesystem. In our example we will use 2 stripe groups with 2 LUNs each. We will use a StripeBreath of 320k. The testing will be done in direct IO and it should help us decide what the best cachebufsize is. We have to be careful not to make the cachebufsize too big to avoid excessive write modify on small IO. We also have to be careful of the limit of the cache buffer size, it need to be a power of 2 value no less than the fsblock size and a multiple of the fsblock size.

Example:

We can use to test the different block size. We want our cache size be a factor of the filesystem block size which in our case is 64k. We can use multiple file to simulate the buffer IO behavior and generate more random access. Here is a script that will test write/read performance at different block size.

#!/bin/bash

FILE=$1

COUNT=10000000

for size in 64 128 256 512 1204

echo "====> Testing ${size}K block size"

echo "==> Write"

Mio -cw -q 1 -n $(( ${COUNT} / ${size} )) -b ${size}K ${FILE}-{A..E}

echo "==> Read"

Mio -q 1 -n $(( ${COUNT} / ${size} )) -b ${size}K ${FILE}-{A..E}

rm -f ${FILE}-{A..E}

done

Running the script gave us the following performance values.

Block Size	Read Performance	Write Performance
64K	647.13 MBytes/Second	606.60 MBytes/Second
128K	691.42 MBytes/Second	734.14 MBytes/Second
256K	714.22 MBytes/Second	794.07 MBytes/Second
512K	717.84 MBytes/Second	822.05 MBytes/Second
1024K	756.12 MBytes/Second	820.71 MBytes/Second

The results are not surprising. When doing write 512K we get about the performance of the 2 LUNs that make a stripe group. However depending on the filesystem usage a cache size of 512K might be to big so using 256k might be better even if the performance is not as good. For our testing we will assume a cachebufsize of 512k because we expect to do big IO. It is to note that the reason the write performance is higher than the read is because most modern RAID have write cache enable.

Section 3 – Raid write cache:

In this section we will see the performance degradation when we disable the write cache on the RAID unit.

Example:

Writing to the block device. You can see that when write cache is disable there is drastic write performance degradation.

# Mio -w -q 1 -n 1000 -b 512K /dev/mapper/mpath58

Mio: Timing 1 stream(s) of 1000 x 512K direct writes queued 1 deep

stream[0]: mpath58: write 524.29 MBytes @ 10.94 MBytes/Second

Mio: Aggregate: 524.29 Mbytes @ 10.94 MBytes/Second

Section 4 – Stripe group performance:

When a file is created data get written to a stripe group. All the file data will try to stay on the same stripe group unless we run out of space. It is possible to have random performance problem when all stripe group do not provide the same performance.

Example:

In the following example we have 2 data stripe group one perform really badly and one perform very well. When we run Mio with multiple file some will go on one SG and some will go to the other.

# Mio -cw -q 1 -n 1000 -b 512K file1 file2 file3 file4

Mio: Timing 4 stream(s) of 1000 x 512K direct writes queued 1 deep

stream[0]: file1: write 524.29 MBytes @ 328.15 MBytes/Second

stream[1]: file2: write 524.29 MBytes @ 40.84 MBytes/Second

stream[2]: file3: write 524.29 MBytes @ 328.14 MBytes/Second

stream[3]: file4: write 524.29 MBytes @ 40.89 MBytes/Second

Mio: Aggregate: 2097.15 Mbytes @ 163.35 MBytes/Second

Here we can see that file1 and file3 had very good performance but file2 and file4 had really bad performance.

# snfsdefrag -c -G 1 file*

file1: no extents for specified stripe group

file2: 44 extent(s)

file3: no extents for specified stripe group

file4: 43 extent(s)

# snfsdefrag -c -G 2 file*

file1: 43 extent(s)

file2: no extents for specified stripe group

file3: 43 extent(s)

file4: no extents for specified stripe group

This shows us that file1 and file3 are on SG1 and file2 and file4 are on SG2, we can then conclude that there is a performance problem with SG2.