Best Practices for Object Synchronization

Overview

Whether capturing images from satellites, creating the next box office thriller or ingesting massive amounts of unstructured data from next-gen sequencers, StorNext users can easily share data and collaborate on a global scale. While application work-flows are unique — some critical requirements never change: data is invaluable, data will continue to grow at unprecedented rates, and, more than ever before, data must be easily shareable to support geographically dispersed teams.

Designed to support data intensive work-flows, FlexSync object synchronization is a simple, but powerful user interface built to take full advantage of S3 storage from Amazon Web Services (AWS S3) or ActiveScale S3 to easily create one or many shareable work-spaces. FlexSync clients use S3 to move content from their local system to a remote work-space.

A FlexSync work-space, also known as a repository, is simple to create and manage. Whether you're in the same location or at a distant site, use FlexSync object synchronization to browse and download content from the remote work-space to your local system. Updated and newly created content can be moved (committed) to the workspace, allowing other users to access the content for continued collaboration. No matter where you might be located, FlexSync object synchronization allows you to share projects or vast archives. As all of the content within the repository are S3 objects, the content is immutable (unchangeable). You can create and use as many repositories, and as many work-spaces as needed.

The FlexSync Object Synchronization Environment

FlexSync object synchronization supports public cloud storage with Amazon Web Services (AWS) and Quantum’s ActiveScale object storage platform. This cloud-based storage facilitates geographically distributed teams to share data in a single archive repository.

You can access the shared S3-based repository to store, browse, and pull data to a local directory or another designated location. You have easy access to the content you need — anywhere and anytime.

Features Not Supported for Object Synchronization

The following features are not supported in FlexSync 3.3.0 when you perform an S3 object synchronization task.

Considerations for a Managed Relation Point

If your file system contains a managed relation point, then you can checkout a repository to a file system that is designated as managed; however, you cannot checkout to a directory (or below) where a relation point is configured.

Example

You designate /stornext/snfs1 as a managed file system per the file system configuration.

You create a sub-directory labeled tape_policy and add a Storage Manager relation point to it.

You can checkout a repository to /stornext/snfs1 but you cannot checkout a repository to /stornext/snfs1/tape_policy (or below).
FlexSync 3.3.0 does not support a non-StorNext file system for an S3 object-based synchronization task.
FlexSync 3.3.0 does not support Apple named streams for an S3 object-based synchronization task.
FlexSync 3.3.0 only supports Amazon Web Services (AWS) S3 and ActiveScale S3 storage destinations.

General Guidelines

Directory and File Information

The supported number of characters in a directory path is 4,096; limit your directory path to 4,096 characters or less.

Non-UTF-8 Files

Caution: File names that are NOT UTF-8 compliant are NOT synchronized; you can scan your file names to determine if you have invalid file name characters. One reason your file name might not be UTF-8 compliant is because the file originated from a file system that allows non-compliant UTF-8 characters (for example, Latin-1 characters). To scan for invalid file name characters in your file system, use the snfsnamescanner -u command (see snfsnamescanner in the StorNext 6 Man Pages Reference Guide); to convert your invalid file name characters in your file system to UTF-8, use the script (utf8FileNames.sh) that is generated by the snfsnamescanner command.

Example

# /usr/cvfs/lib/snfsnamescanner -u /stornext/test

Wed Feb 26 08:15:11 2020

Starting search in: /stornext/test

Scanning for:

invalid UTF8 names

Files/directories scanned: 1

Elapsed time: 00:00:00

0: File names with invalid UTF8 results in ./utf8FileNames.sh

Supported File Systems for Object Synchronization

You can only use a StorNext file system when you perform an object synchronization task.

AWS S3 File Size Limit for Object Synchronization

The maximum supported file size for an object synchronization task to an AWS S3 repository is five terabytes (5 TB).

Deduplication for Object synchronization

FlexSync version 3.0 (or later) supports deduplication for an object synchronization task and the feature is enabled by default.

How does object-based deduplication work?

The deduplication process consists of two phases:

Phase 1

The deduplication process scans your local working directory and generates a checksum for each file. If the process identifies duplicate files that contain an identical checksum, then the files are deduplicated and only one copy of a file is uploaded to your S3 repository. The process also creates an object your S3 repository for each file.
Phase 2

The deduplication process then performs a comparison of the checksum from a file that is scheduled to upload, against the checksum of that object in your S3 repository. If the process identifies an identical checksum, then the process generates the equivalent of a hardlink to the existing copy of the file in your local working directory (to prevent a new object from being generated in your S3 repository).

Note: The hardlink is a separate object and the deduplication process against your S3 repository is completed prior to an object being uploaded during a current commit process.

Deduplication Considerations

If you perform a deduplication process against every file before you commit to your S3 repository, then the process might require a large amount of time.
If your system contains a large amount of duplicate files, enable the deduplication option to save space and time when you transfer your data to your S3 repository.
If your system contains a small amount of duplicate files, then the deduplication process uses unnecessary time during the pre-processing of the files. In this case, disable the deduplication option.

Network Time Protocol (NTP) for Object Synchronization

Your system clock must be synchronized with the clock kept by the S3 object repository. If the clocks differ by a large enough value, the FlexSync S3 object synchronization requests fail.

If you plan to perform an S3 object synchronization task, then Quantum recommends you configure NTP (or equivalent) on your system(s).

Object Synchronization Data

The following data is stored to your S3 object repository when you perform an S3 object synchronization task:

executable bit
extended attributes (xattrs)
file type and mode/permissions (as returned by a stat syscall)
ftype: file, node, symlink
gid number
mime_type (for example, text, plain)
modification time
sha256 sum
size
uid number

The following data is not stored to your S3 object repository when you perform an S3 object synchronization task:

access time
ACLs

Miscellaneous

When you configure a task, you can only configure one task to write to a destination directory. In other words, you cannot overlap or run multiple tasks to run the same process on the same directory structure concurrently/simultaneously.

Performance

Optimal over-all performance and maximum throughput can be achieved by following these best practices:

Limit the number of large memory and computer-intensive applications that are running while FlexSync tasks are also being executed.
For systems where FlexSync is configured to run on MDCs, if the primary node (node 1) should failover (to node 2), Quantum recommends that you fail-back to node 1 as quickly as possible, as this distributes the compute and I/O processes across both nodes.

WARNING: While running in a failover scenario, with all compute and storage processing running on a single node, applications running within the Dynamic Application Environment (DAE) have degraded performance. If you are able to tolerate degraded performance, then you can continue to run FlexSync on node 2 until you are able to fail-back to node 1 being the primary node.
Tuning the Host TCP stack might be necessary to optimize network throughput for Flexsync tasks, but should be done with care, since these changes affect all network activity and resources. For example:

net.core.wmem_max = 268435456

net.core.rmem_max = 268435456

net.ipv4.tcp_rmem = 4096 65536 268435456

net.ipv4.tcp_wmem = 4096 65536 268435456

net.ipv4.tcp_window_scaling = 1

Temporary Files

If your system generates temporary files that you do not want to synchronize to your repository, you can configure the cool down option (see Cool Down).

Alternatively, you can configure an exclusion (see Exclusions).

Directory and File Counts

If your check-out, update, or commit process contains a large number of files and/or directories, then your system consumes a larger amount of memory. In addition, the amount of time required to perform the action(s) to your S3 repository is significantly increased.

Quantum recommends you limit the number of files and/or directories you include in your check-out, update, or commit process to a maximum of ten million (a lower number of files and/or directories improves the performance).

For example, the amount of memory required for ten million files, with one million files per directory, can be greater than 30 GB.

If you configure multiple automated tasks (that contain a large number of files) to run concurrently, Quantum recommends you schedule the tasks at different times. Otherwise, the amount of memory required is be multiplied by the number of tasks running concurrently.