Coherency for workflows in NAS and SNFS Environments

Overview

Applications in a computing environment produce, consume, or modify information. This information must be organized and there are many ways application developers have chosen this organization. There are a myriad of applications such as databases, video, and many others that produce, consume, or modify their data. Some store all the information in one file. Others use many files and folder/directories, to represent their information.

Most applications run on a single computer or virtual machine with a specific operating system. That operating system, such as Linux, Windows, or Mac OS, has a file system on which to do all this work. Each operating system has specific techniques in place to make access to the data quick, easy, and consistent. And, many applications are comprised of separate processes that can be running concurrently on the same machine. Operating systems provide certain guarantees that data written by one process is available to the other processes. They also provide particular mechanisms so that different processes can synchronize their activity. For instance, there are lock files and exclusive access mechanisms that prevent uncoordinated access.

Operating systems and file systems provide caching mechanisms to speed up accesses. These mechanisms allow for faster data retrieval and data locality. In addition, caching is used for meta-data such as file attributes and directory contents. These are stored in memory to prevent repeated access from slower storage for data that has not changed.

Example workflows

There are many workflows. One example is a DPX video stream. In this case, we can see each frame in its own file with audio is in a separate file. There is also a small HTML file with information about the video stream. There can be many thousands of frames. All of this content is typically in one directory.

Other video applications store all of the frames in a large file with a size in the 10s of gigabytes or more. The audio can be embedded, too.

A database typically has control files, data files, and the logs that are used in the event of a failure. These can all be in the same folder.

There are many, many other workflows that are organized using one file, to many, many files and folders/directories.

Clustered and Distributed Environments

There are many good reasons to access workflows from many machines. This is done with SMB/CIFS, NFS, or other access methods such as the StorNext File System (SNFS). Each mechanism allows a different machine and possibly a different operating system to see the files and folders that were created on another machine. This is a powerful capability, but it creates some interesting challenges.

When one machine writes the data, when is the data available on another machine? When a file is created on another machine, when will it appear on the other machine? If many directories and files are all modified in a workflow, when do we know that all the modifications are finished and available to another access method, for example, NFS, SMB, or SNFS? The answers to these questions are part of the coherency dilemma.

As was mentioned earlier, each operating system works hard to provide high performance, so data and meta-data is cached in local memory. Clustered environments must also provide mechanisms so good performance can be realized. Therefore, they cache as close as possible to the application and also attempt to minimize communication. Techniques like read-ahead and directory/folder caches are common. But these provide challenges to the coherency seen on different nodes in the cluster. The answers to the above questions depend on how each access method invalidates caches or how races between concurrent accesses are handled.

Note: When mixing protocols such as SMB and SNFS, when you create or delete files and directories using one protocol, another protocol can take a while to see the changes and there may be a need to refresh. For example, if you create a file on an SNFS client, you might need to refresh Finder or Explorer on an SMB client to see the file.

In general, uncoordinated writers to the same file on different SNFS clients are not going to be coherent. If the writers (append or otherwise) do not coordinate who is writing with locks or some other mechanism, we cannot expect coherency.

With Linux SNFS (and maybe some other OS's besides Linux), we could accidentally have coherency but in general, this is not a suggested or "supported" workflow.

The following might cause uncoordinated appending writers to corrupt:

Windows clients
If the model is switched to ioTokens=false, this happens if a MAC client or pre-StorNext 6.0 client is one of the writers.
NFS access
Possibly SMB

Supported workflows are producer/consumer models or workflows with coordination where multiple writers use locks or some other mechanism to protect against concurrent writes of the same part of the file, including appending.

NFSv3 – Coherency

NFS has been touted as a stateless file system. This has advantages but limits coherency. For performance reasons, NFS caches file/directory attributes and contents. This can be thought of as state. For example, the state of a file includes who owns it and who can write it. The state of a directory includes which files exist in that directory.

NFS has no internal mechanism for locking of files or directories to protect cached state. NFS expires, after some timeout, its directory caches and file attributes. For example, if one lists the files in a directory, the contents of that directory are cached on that NFS client. If another client creates a file in that directory, if a different client is continually checking for new files, it will typically not see the new file for several seconds. This timeout can be controlled with mount options. The same is true with the attributes of the file including the size. If one has a file that is 10 bytes long and continually checks the size on one client and another client appends 100 bytes to that file, the size won’t change for many seconds.

As a result of this lack of consistency across NFS clients, NFS is said to have weak cache consistency or coherency.

Appending files concurrently on different NFS clients will likely corrupt the data. This is not supported on NFS. See https://unix.stackexchange.com/questions/299627/multiple-appenders-writing-to-the-same-file-on-nfs-share for details.

There have been efforts to tighten NFS consistency. For example, if one NFS client checks the size of a file after it is written on another client, the local attribute cache continues to show a stale size for several seconds. But, if that NFS client opens and reads the file, the consistency issue is remedied more quickly, and this is part of what is called NFS close-to-open.

...when an application opens a file stored in NFS, the NFS client checks that it still exists on the server, and is permitted to the opener, by sending a GETATTR or ACCESS operation. When the application closes the file, the NFS client writes back any pending changes to the file so that the next opener can view the changes. This also gives the NFS client an opportunity to report any server write errors to the application via the return code from close(). This behavior is referred to as close-to-open cache consistency.

See http://nfs.sourceforge.net/ section A8. What is close-to-open cache consistency? for a more detailed explanation and several caveats. It explains the noac No attribute cache mount option and how to avoid the data cache with O_DIRECT open flag.

Note this comment from Sourceforge:

There are still opportunities for a client's data cache to contain stale data. The NFS version 3 protocol introduced "weak cache consistency" (also known as WCC) which provides a way of checking a file's attributes before and after an operation to allow a client to identify changes that could have been made by other clients. Unfortunately, when a client is using many concurrent operations that update the same file at the same time, it is impossible to tell whether it was that client's updates or some other client's updates that changed the file.

For this reason, some versions of the Linux 2.6 NFS client abandon WCC checking entirely, and simply trust their own data cache. On these versions, the client can maintain a cache full of stale file data if a file is opened for write. In this case, using file locking is the best way to ensure that all clients see the latest version of a file's data.

That said, file locking, in a diverse environment with SMB and NFS or with modifications beneath the NFS server directly on the underlying file system, does not ensure anything unless all access methods to the data honor those locks.

For NFS and POSIX locks, see POSIX Advisory Byte Range Locks.

SMB – Coherency

SMB/CIFS has better consistency than NFS in some cases because it uses Opportunistic Locks (oplocks). But, implementations of SMB on different operating systems continue to cache directory contents and attributes without any locks. The oplocks are used, typically, so that a client can cache data before sending it to the server or for re-use. If any other application opens the file when an oplock is held, that oplock is broken and the data and associated attributes are flushed and discarded before the other application’s open returns. This provides some degree of coherency.

Directory/folder contents are cached and used without refresh for several seconds. This is up to each operating system’s implementation. Testing on Windows and Mac OS showed that repeated directory listings do not see a file created on another client for several seconds … from 5-10 seconds in some of our testing. File attributes are also cached in a similar fashion.

Note: Apple macOS does not use SMB oplocks. Instead, the Mac performs write through caching. Testing has shown that multiple writes are gathered before sending a larger write to the SMB server and this is done without any locking. A program sequentially writing a 2mbs file with 4k writes generates 512k writes to the SMB server. If a small file is read over and over on the Mac, each read results in a read to the SMB server, so no data cache seems active which makes sense since no oplocks are being obtained.

SMB has additional capabilities that extend Window’s directory notifications, so applications can register for events on a directory. Windows Explorer uses this, so it is notified when another client changes the directory. If one has a Windows Explorer window open in a directory and then creates a file on another client, the file name is seen quickly, independent of the directory cache in SMB.

Native SNFS – Coherency of Directories and File Attributes

SNFS regularly caches data and file attributes on each client as files and directories are referenced and used. Typically, SNFS does not use locking to guarantee coherent data and attributes. For example, if a user accesses a file, the attributes are retrieved from the server and saved on the client without any locks. If another client changes the file’s attributes, it typically does this without obtaining any lock. However, when the change occurs on the server, a notification goes out to all clients that have the file/directory referenced. The notification contains the new values of any changed attributes. This change is updated on all clients, so their cached copy is “fixed.” This operation is called a NodeChange. The next time the file is accessed on each client, the correct attributes are used.

Note that with NodeChanges, there is a slight timing issue. All attribute changes occur by sending a message to the server to perform the change. For example, if the ownership of a file changes, the invoking client sends an RPC to the server. The NodeChange work is handed off to another thread in the server and the response is then returned to the invoking client. As the response to the invoking client is being built on the server, another thread issues the NodeChange to all clients that have that file/directory referenced. If the system is extremely busy, those NodeChange messages could take some time to reach all the clients. In nearly all cases, the attributes are correct when one does the following operations:

Client 1: reference object A
Client 2: change object A
Client 2: tell client 1 to go
Client 1: check object A

With directory contents, there is also a cache on each client. The directory contents are invalidated if the local client changes the directory or if a NodeChange arrives for that directory with a new modify time. Whenever a directory has a file created, removed, renamed, and so on. a NodeChange is issued to all clients with a reference to that directory.

There is an enhancement to directories that if too many clients have a directory referenced, the NodeChange indicates that that directory should be marked STALE and refreshed when used again. This prevents too many NodeChanges. The mechanism is a bit more complex than this but suffice it to say that NodeChange messages are reduced greatly for heavily used directories in the cluster.

Native SNFS – Coherency of File Data

SNFS has two coherency models for file data. They are called the SHARED_WRITE model and the I/O token model. See StorNext File System Data Coherence for a high level description.

NFS and SNFS Concurrent Access

In this section, the NFS server is running on one or more SNFS clients exporting the file system that is also visible directly on other native SNFS clients.

NFS and SNFS are more tightly integrated than NFS and SMB. Years of development have gone into helping coherency and functionality when NFS runs over SNFS. However, applications that need tight coherency between distinct nodes, should realize that NFS only provides “weak cache consistency” WCC. See Native SNFS – Coherency of File Data for details. SNFS cannot improve on the NFS access method’s limited guarantees.

As mentioned in Native SNFS – Coherency of File Data, a get attribute call will make attributes coherent within the NFS server running over an SNFS client, if the I/O Token model is in use on all SNFS clients accessing the same file and there are multiple opens on that file.

In basic producer-consumer tests, the combination of these two access methods (NFS & SNFS) provide WCC and applications usually run fairly well. However, very quick access to data and meta-data on an NFS client that is produced on another NFS or SNFS clients should not be expected to always work, especially with network interruptions.

SMB and SNFS Concurrent Access

This section covers SNFS clients accessing data concurrently with Microsoft Windows or Apple macOS clients accessing the same data using SMB. The SMB server(s) (samba) are running on StorNext NAS machines that are exporting the same file system, so they are in fact SNFS clients.

As described above SMB uses oplocks to cache data. When a native SNFS client opens or does one of many other operations, an FsmOpenChange message is sent to the SNFS client under the samba server. That message causes a conflicting oplock to be broken. The SNFS client causing the break waits up to 45 seconds for the break to finish before the operation (for example, open()) is allowed to complete back on the native SNFS client. The break usually completes on the order of a millisecond but could take longer if the SMB client has much data to flush or is too busy.

This integration between SMB and SNFS is introduced in StorNext 6.2 with Appliance Controller 2.2.0. This allows for more coherent access to data between SNFS clients and SMB on Windows. Recall from the SMB discussions above that Mac clients do not use oplocks.

Recall that SMB typically caches directory and attribute contents so can be stale. Native SNFS clients also cache but with the NodeChange mechanism described above, the caches are kept very coherent. SMB clients can be configured to not have these caches, but the SMB applications have reduced performance.

SMB also supports Directory Notifications, and notifications work correctly between SMB clients sharing the same folders. However, as of StorNext 6.2 with Appliance Controller 2.2.0, if a Native StorNext client modifies a directory, a notification is not generated to an SMB client connected to StorNext NAS. In a subsequent release the anticipation is that a NodeChange operation will communicate with samba/ctdb server and cause the Notification event.

NFS and SMB Concurrent Access

In this section, the NFS and SMB clients are each using the same SNFS file system and directories. Distributed workflows access files and directories using NFS and SMB concurrently. The NFS clients use an NFS server and the SMB clients use an SMB or samba server. In the case of StorNext SNFS, the NFS and SMB servers could be on distinct SNFS clients.

Many operations work well but there are some interesting challenges and surprises.

As mentioned above, SMB uses oplocks to cache data (on Windows but not on MacOS). Let us start with a file that is open and being written to on a Windows SMB client. The SMB server has granted an oplock to that SMB client. If an NFS client starts looking at the file say using: /bin/ls -l <filename>, it will cache the attributes and will see the file with a certain size. As the files continues to grow, subsequent ls -l <filename> commands will not see the file size or the modify time updated for seconds. The ls command does not break the oplock. So, data and attributes continue to be cached on the NFS client thereby not seeing any deferred writes or delayed updates of the size from activity on the Windows SMB client. NFS keeps its own attributes and directory contents cached refreshing every so often.

A Linux NFS client that had already read the file can have the data and attributes cached, too. A re-read of the file simply re-obtains the attributes and if there is no significant change, the cached contents are used. This NFS operation (get attributes) does not check or break an oplock on the NFS client or server. So, it is certainly possible for a re-read to not see previously written contents since the oplock is not broken on a re-read. But, normally, an SMB write quickly updates the size and or modify time of the file and this invalidates the cached state on the NFS client since the attributes are checked at the start of a re-read. If the start of the read sees a significant change in the attributes, it discards the cache and re-reads the file from the server. That operation (NFS read) breaks the oplock.

If a Linux NFS client attempts to append a file that has an outstanding oplock, the open on the NFS client does not break the oplock. The write is built with attributes (potentially a stale size) and sent to the server lazily. The final close, though, will flush all the data and attributes on an NFS client. When the first write arrives at the NFS server, a 7.5 Centos version of the StorNext product (xcellis base) will issue an internal nfsd_open and this results in an oplock break.

In general, we can see that NFS does not keep files open, so it does not play very well with oplock semantics. Oplocks with samba are implemented on the server with leases in the linux kernel. A lease obtain is refused if a file is already open in a conflicting mode (read conflicts with write oplock but read oplock doesn’t). But, NFS does not keep files open. Also, an existing lease is broken with an open(2) or truncate(2) and NFS defers or completely avoids opens in Centos 6.10 kernels which run on Artico. However, the NFS server code explicitly breaks the oplock if it is doing an I/O. In 7.5 Linux kernels, truncate(1) operations arriving on the NFS server do NOT break oplocks either. They do in 6.10 kernels but not 7.5 which runs on Xcellis.

Another significant issue for coherency is that Apple macOS smbfs does not use oplocks. It does aggressively write to the server but caches state such as file attributes and directory contents. This presents more coherency challenges. Tests have shown that re-reads always seem to read from the NFS server.

All this said, access using SMB and NFS are usually coherent. Applications should be aware that appending writes should be avoided for shared files when using NFS. Tight races between applications on separate nodes should also be avoided in this mixed environment. Quantum’s internal “prodcon” test runs without errors when run between SMB on windows and NFS on Linux … but delays had to be introduced for the stale directory caches and other stale cached state.

When running SMB on Mac and NFS or SMB on another client, the prodcon test also passed but only after some modifications. A 5 second delay was introduced to wait after the producer was completed. But, when run on Mac High Sierra, directories would sometimes stick as empty, even though files would been created on another node many minutes later. The workaround to this problem was to create and remove a file in that directory, when an empty directory was encountered on the consuming client. That caused the SMB’s state of “empty directory” to be tossed. The Apple macOS Mojave release does not have this same problem.

What if one wants so use file locking over SMB and NFS? This is work not yet completed as of this StorNext release. For now, POSIX advisory locks work between NFS clients and native SNFS clients or other NFS clients. They do not work at all between NFS and SMB. SMB supports Windows Mandatory locks and they are not yet sufficiently integrated through samba into the StorNext file system. Windows Mandatory locks work between clients sharing files with SMB. Also, they work on native SNFS Window’s client but not between SMB and Native SNFS Windows clients.

Re-share SNFS Microsoft Window’s Client

A Re-share SNFS Window’s client is a Window system mapped to a Native StorNext Windows Native SAN/LAN client’s SNFS file system. In this case, the StorNext NAS software and hardware is not being used since the NAS offering uses Linux on its SMB servers. More development is needed to integrate these systems with StorNext NAS. Customers do this since an all Windows environment eliminates some of the challenges introduced by the samba implementation in the StorNext NAS offering. So, whatever Native SNFS Windows’ clients see will function as expected on the SMB/CIFS Windows Clients. This includes handling of Windows Directory Notifications (even when another non-Windows SNFS client modifies the directory) and Windows Sharemodes (even with other SNFS Windows clients – see GlobalShareMode setting).

POSIX Advisory Byte Range Locks

POSIX Advisory Byte Range Locks provide cooperating applications the ability to lock a file’s byte range, wait for that range to be unlocked, and check to see if there is an existing lock. These locks do not prevent applications from modifying the file or deleting it.

A significant addition for this release is the start of the integration of POSIX locks functionality between NFS and SNFS. If an application holds a POSIX lock on an NFS client, if the NFS server (exporting an SNFS file system) restarts or is failed over to another NFS server, the lock state moves to the new server. Prior to this release, the lock state would be lost and not visible on the new NFS server or within SNFS. When the state was lost, a new request for the lock would be incorrectly granted. This enhancement required changes to the NFS server, SNFS, and the supporting operating system and is therefore only supported on Appliance Controller 2.2.0.

NFS clients detect a server state change and issue a lock reclaim. The lock reclaim is converted into an internal StorNext lock recovery action thereby restoring the lock to the correct state within SNFS. These locks can be seen on the MDC with the cvadmin(8) command, repfl.

If an NFS client requests a “Set or Wait if busy” lock (F_SETLKW), NFS issues an F_SETLK with a wait flag to the NFS server using the Network Lock Manager Protocol (NLM) implemented by lockd(8). This request then requests the lock from the underlying SNFS file system with an F_SETLK which goes to the SNFS server. So, there are 4 places that can have lock state:

The NFS client (actually lockd)
- Once a lock is obtained, it is remembered to re-send as a reclaim, if a failure occurs between the NFS client and the NFS server.
The NFS server (actually lockd).
- Once a lock is granted, the NFS server maintains the lock state to release the state in the event of an NFS server failure or NFS client disconnect.
The SNFS client underneath the NFS server
- The state is maintained in the SNFS client to reconstruct state if the SNFS MDC restarts or the network connection between the SNFS client and MDC gets re-established.
The SNFS MDC.
- This is the core lock state across the cluster. When another client attempts to get the lock, this is where the decision is made to grant the lock or refuse the lock.

When the NFS client receives the F_SETLKW, it issues an NLM lock request to get the lock with an indication it is for a “wait” lock. If NLM request comes back BUSY from the NFS server, the NFS client waits. There is no thread waiting on the NFS server but the lockd implementation records that there is an NFS client waiting for this lock. When the lock state on the NFS server is released, lockd is notified and an NLM GRANT callback is invoked notifying the NFS client.

With Native SNFS, if an application issues an F_SETLKW call, it waits within the SNFS client. There is an asynchronous notification from the SNFS MDC to the SNFS client when the lock is released. But, the NFS server (actually lockd) over SNFS does not issue a F_SETLKW to SNFS so there is no thread waiting to receive the notification when the lock is released. If SNFS is underneath lockd, the lockd implementation is not invoked by SNFS to record that there is an NFS client waiting for this lock so there is no NLM GRANT request sent to the NFS client.

More work is needed with StorNext NAS to issue the NLM GRANT callback to the NFS client. Fortunately, the NFS client retries the lock every so often and the lock will be granted with a maximum delay measured of 30 seconds, with a Linux NFS client. There are plans to integrate the SNFS client and the NFS server

Note that Windows Locks and SMB oplocks are very different and do not interact with POSIX locks. There are plans to integrate Windows Mandatory Locks and POSIX locks in a later release.

On Mac OS, for locks to be reclaimed correctly, the file /etc/nfs.conf must be modified to add nfs.statd.send_using_tcp = 1.

**Table 1:** Supported Access Methods on Different Clients that Can Use POSIX Locks
POSIX Advisory Integrated by AMs	NN3C	NSMC	LNAC	Xsan	WNAC	WRC
NAS NFS V3 Client (NN3C)	Yes	No	Yes	Yes	No	No
NAS SMB Client (NSMC)	No	No	No	No	No	No
Linux Native LAN/SAN Client (LNAC)	Yes	No	Yes	Yes	No	No
Xsan Client (Xsan)	Yes	No	Yes	Yes	No	No
Windows Native SAN/LAN (WNAC)	No	No	No	No	No	No
Windows “Re-share” Client (WRC )	No	No	No	No	No	No

Microsoft Windows Mandatory Byte Range Locks and Sharemodes

When a file is created or opened on Windows, the caller can specify sharemodes for other processes. This mechanism controls coherency such that an application can deny operations like write or delete while the file is open. This capability is distributed between clustered nodes by SMB and for native StorNext SNFS if the GlobalShareMode capability is set to TRUE. But this capability is not honored if NFS or Native StorNext non-Windows clients perform a conflicting operation. And, SMB has not been integrated tightly enough with StorNext to handle this “sharemode” capability between SMB and native SNFS at this time.

Windows Mandatory BRLs continue to operate as before this release. If two Windows clients over SMB use these locks, they work if they are both using SMB. If one is using Native StorNext and another is using SMB, the locks will not work correctly. But, if the SMB client is running on a Windows Re-share system instead of the NAS StorNext presentation, the locks will work.

**Table 2:** Support for Windows Mandatory Byte Range Locks and Sharemodes
	Mandatory Lock/Sharemode
Supported AMs	NSMC	WNAC	WRC
NAS SMB Client (NSMC)	Yes	No	No
Windows Native SAN/LAN (WNAC)	No	Yes	Yes
Windows “Re-share” Client (WRC)	No	Yes	Yes