Deduplication Overview
Here is the view from 100,000 feet. When StorNext deduplication is enabled, a file is examined and logically split into data segments called BLOBs (binary large objects). Each BLOB has a 128-bit BLOB tag. A file can be reconstructed from the list of BLOBs that make up a file. The data for each BLOB is stored in the blockpool for a machine. We can use the command snpolicy -report file_pathname
to see the list of BLOB tags for a deduplicated file.
When a deduplicated file is replicated, the BLOBs are replicated from the blockpool on the source machine to the blockpool on the target machine. If the source file system and the target file system are both hosted on the same machine, no data movement is needed. If the same BLOB tag occurs several times (in one file or in many files) only one copy of the data BLOB exists in the blockpool. During replication that one copy must be copied to the target blockpool only once.
This is why deduplicated replication can be more efficient than non-deduplicated replication. With non-deduplicated replication, any change in a file requires that the entire file be recopied from the source to the target. And, if the data is mostly the same in several files (or exactly the same), non-deduplicated replication still copies each entire file from the source file system to the target.
The following example uses these three files and their corresponding sizes:
f.2m - 2 MB
f.4m - 4 MB
g.4m - 4 MB
The maximum segment size in this example is 1 MB. (That size is artificially low for this example only.)
If we look at the "snpolicy -report
" output for the directory containing these files, you see the following:
|
All three files have the same contents in the first megabyte starting at offset 0. The tag for that BLOB is D03281B0629858844F20BB791A60BD67
, and that BLOB is stored only once in the blockpool. The second megabyte is the same for files f.2m
and f.4m
(tag 12665A8E440FC4EF2B0C28B5D5B28159
) but file g.4m
has a different BLOB in those bytes. The final 2 megabytes of files f.4m
and g.4m
are the same.
Remember that the above is an artificial example. In actual practice BLOBs do not line up on 1 MByte boundaries and are not all the same length.