Deduplication Overview

When StorNext deduplication is enabled, a file is examined and logically split into data segments called Binary Large Objects (BLOBs). Each BLOB has a 128-bit BLOB tag. A file can be reconstructed from the list of BLOBs that make up a file. The data for each BLOB is stored in the blockpool for a machine. We can use the command snpolicy -report file_pathname to see the list of BLOB tags for a deduplicated file.

When a deduplicated file is replicated, the BLOBs are replicated from the blockpool on the source machine to the blockpool on the target machine. If the source file system and the target file system are both hosted on the same machine, no data movement is needed. If the same BLOB tag occurs several times (in one file or in many files) only one copy of the data BLOB exists in the blockpool. During replication that one copy must be copied to the target blockpool only once.

This is why deduplicated replication can be more efficient than non-deduplicated replication. With non-deduplicated replication, any change in a file requires that the entire file be recopied from the source to the target. And, if the data is mostly the same in several files (or exactly the same), non-deduplicated replication still copies each entire file from the source file system to the target.

The following example uses these three files and their corresponding sizes:

f.2m - 2 MB

f.4m - 4 MB

g.4m - 4 MB

The maximum segment size in this example is 1 MB. (That size is artificially low for this example only.)

If we look at the "snpolicy -report" output for the directory containing these files, you see the following:

./f.2m

policy: 1720449 inode: 1720468

flags: TAG

mtime: 2010-01-26 14:20:03.590665672 CST

ingest: 2010-01-26 14:20:03.590665672 CST

size: 2097152 disk blocks: 4096

seqno: 4 blk seqno: 2

offset: 0 length: 1048576 tag: D03281B0629858844F20BB791A60BD67

offset: 1048576 length: 1048576 tag: 12665A8E440FC4EF2B0C28B5D5B28159

./f.4m

policy: 1720449 inode: 1720470

flags: TAG

mtime: 2010-01-26 14:22:56.798334104 CST

ingest: 2010-01-26 14:22:56.798334104 CST

size: 4194304 disk blocks: 8192

seqno: 4 blk seqno: 4

offset: 0 length: 1048576 tag: D03281B0629858844F20BB791A60BD67

offset: 1048576 length: 1048576 tag: 12665A8E440FC4EF2B0C28B5D5B28159

offset: 2097152 length: 1048576 tag: 7F02E08B3D8C35541E80613142552316

offset: 3145728 length: 1048576 tag: 1FEC787120BEFA7E6685DF18110DF212

./g.4m

policy: 1720449 inode: 1720471

flags: TAG

mtime: 2010-01-26 14:23:28.957445176 CST

ingest: 2010-01-26 14:23:28.957445176 CST

size: 4194304 disk blocks: 8192

seqno: 5 blk seqno: 4

offset: 0 length: 1048576 tag: D03281B0629858844F20BB791A60BD67

offset: 1048576 length: 1048576 tag: DF54D6B832121A80FCB91EC0322CD5D3

offset: 2097152 length: 1048576 tag: 7F02E08B3D8C35541E80613142552316

offset: 3145728 length: 1048576 tag: 1FEC787120BEFA7E6685DF18110DF212

All three files have the same contents in the first megabyte starting at offset 0. The tag for that BLOB is D03281B0629858844F20BB791A60BD67, and that BLOB is stored only once in the blockpool. The second megabyte is the same for files f.2m and f.4m (tag 12665A8E440FC4EF2B0C28B5D5B28159) but file g.4m has a different BLOB in those bytes. The final 2 megabytes of files f.4m and g.4m are the same.

Remember that the above is an artificial example. In actual practice BLOBs do not line up on 1 MByte boundaries and are not all the same length.