Deduplication Overview
When StorNext deduplication is enabled, a file is examined and logically split into data segments called Binary Large Objects (BLOBs). Each BLOB has a 128-bit BLOB tag. A file can be reconstructed from the list of BLOBs that make up a file. The data for each BLOB is stored in the blockpool for a machine. We can use the command snpolicy -report file_pathname
to see the list of BLOB tags for a deduplicated file.
When a deduplicated file is replicated, the BLOBs are replicated from the blockpool on the source machine to the blockpool on the target machine. If the source file system and the target file system are both hosted on the same machine, no data movement is needed. If the same BLOB tag occurs several times (in one file or in many files) only one copy of the data BLOB exists in the blockpool. During replication that one copy must be copied to the target blockpool only once.
This is why deduplicated replication can be more efficient than non-deduplicated replication. With non-deduplicated replication, any change in a file requires that the entire file be recopied from the source to the target. And, if the data is mostly the same in several files (or exactly the same), non-deduplicated replication still copies each entire file from the source file system to the target.
The following example uses these three files and their corresponding sizes:
f.2m - 2 MB
f.4m - 4 MB
g.4m - 4 MB
The maximum segment size in this example is 1 MB. (That size is artificially low for this example only.)
If we look at the "snpolicy -report
" output for the directory containing these files, you see the following:
./f.2m
policy: 1720449 inode: 1720468
flags: TAG
mtime: 2010-01-26 14:20:03.590665672 CST
ingest: 2010-01-26 14:20:03.590665672 CST
size: 2097152 disk blocks: 4096
seqno: 4 blk seqno: 2
offset: 0 length: 1048576 tag: D03281B0629858844F20BB791A60BD67
offset: 1048576 length: 1048576 tag: 12665A8E440FC4EF2B0C28B5D5B28159
./f.4m
policy: 1720449 inode: 1720470
flags: TAG
mtime: 2010-01-26 14:22:56.798334104 CST
ingest: 2010-01-26 14:22:56.798334104 CST
size: 4194304 disk blocks: 8192
seqno: 4 blk seqno: 4
offset: 0 length: 1048576 tag: D03281B0629858844F20BB791A60BD67
offset: 1048576 length: 1048576 tag: 12665A8E440FC4EF2B0C28B5D5B28159
offset: 2097152 length: 1048576 tag: 7F02E08B3D8C35541E80613142552316
offset: 3145728 length: 1048576 tag: 1FEC787120BEFA7E6685DF18110DF212
./g.4m
policy: 1720449 inode: 1720471
flags: TAG
mtime: 2010-01-26 14:23:28.957445176 CST
ingest: 2010-01-26 14:23:28.957445176 CST
size: 4194304 disk blocks: 8192
seqno: 5 blk seqno: 4
offset: 0 length: 1048576 tag: D03281B0629858844F20BB791A60BD67
offset: 1048576 length: 1048576 tag: DF54D6B832121A80FCB91EC0322CD5D3
offset: 2097152 length: 1048576 tag: 7F02E08B3D8C35541E80613142552316
offset: 3145728 length: 1048576 tag: 1FEC787120BEFA7E6685DF18110DF212
All three files have the same contents in the first megabyte starting at offset 0. The tag for that BLOB is D03281B0629858844F20BB791A60BD67
, and that BLOB is stored only once in the blockpool. The second megabyte is the same for files f.2m
and f.4m
(tag 12665A8E440FC4EF2B0C28B5D5B28159
) but file g.4m
has a different BLOB in those bytes. The final 2 megabytes of files f.4m
and g.4m
are the same.
Remember that the above is an artificial example. In actual practice BLOBs do not line up on 1 MByte boundaries and are not all the same length.

When creating or editing a policy through the StorNext GUI, select the Deduplication tab and make sure deduplication is enabled (On). If you use the snpolicy dumppol
option, you will see dedup=on
in the output when the policy has deduplication enabled.

Note that in the "snpolicy -dumppol
" output shown earlier we also saw dedup_age=1m. This means the file may be deduplicated after it has not changed for at least one minute. If a file is being written its file modification time (mtime) will be updated as the file is being written. Deduplication age specifies how far in the past the modification time must be before a file can be considered for deduplication.

If replication is used, a blockpool is required even if deduplication is not used in any policy on a machine. However, in this situation the blockpool does not store any BLOBs from any file system and can therefore be small: several megabytes is all that is needed.
If you enable deduplication on any policy in the machine, StorNext stores BLOBs in the blockpool and additional space is required. Make sure you have enough space to store file system data if you enable deduplication. You also need space for BLOBs in the blockpool if the machine contains replication target directories for deduplicated replication source directories on other machines.
The current StorNext release supports only one blockpool per machine. Any file system on the machine that needs a blockpool will use that one and only blockpool.