Deduplication Best Practices

This section describes some best practices related to using the StorNext deduplication feature.

Deduplication and File Size

Deduplication will not be beneficial on small files, nor will it provide any benefit on files using compression techniques on the content data (such as mpeg format video). In general, deduplication is maximized for files that are 64MB and larger. Deduplication performed on files below 64MB may result in sub-optimal results.

You can filter out specific files to bypass by using the dedup_skip policy parameter. This parameter works the same as filename expansion in a UNIX shell.

You can also skip files according to size by using the dedup_min_size parameter.

Deduplication and Backups

Backup streams such as tar and NetBackup can be recognized by the deduplication algorithm if the dedup_filter parameter on the policy is set to true.

In this configuration the content of the backup image is interpreted to find the content files, and these are deduplicated individually. When this this flag is not set to true, the backup image is treated as raw data and the backup metadata in the file will interfere with the reduction potential of the deduplication algorithm. Recognition of a backup stream is according to its contents, not the file name.

Deduplication and File Inactivity

Deduplication is performed on a file after a period of inactivity after the file is last closed, as controlled by the dedup_age policy parameter. It is worth tuning this parameter if your workload has regular periods of inactivity on files before they are modified again.

Note: Making the age too small can lead to the same file being deduplicated more than once.

Deduplication and System Resources

Running deduplication is a CPU and memory-intensive operation, and the backing store for deduplicated data can see a lot of random I/O, especially when retrieving truncated files.

Consequently, plan accordingly, and do not under-resource the blockpool file system or metadata system if you are striving for optimal performance.

Deduplication Parallel Streams

The number of deduplication parallel streams running is controlled by the ingest_threads parameter in /usr/cvfs/config/snpolicyd.conf.

If you are not I/O limited and have more CPU power available, increasing the stream count from the default value of 8 streams can improve throughput.