Callbacks
The cornerstones of the communications between the FSM and the client are callbacks and tokens. A callback is an unsolicited message from the FSM to the client requesting that the client adjust its real-time I/O parameters. The callback contains a token that specifies the amount of non-real-time I/O available on a stripe group.
Initially, all stripe groups in a file system are in non-real-time (ungated) mode. When the FSM receives the initial request for real-time I/O, it first issues callbacks to all clients informing them that the stripe group is now in real-time mode. The token accompanying the message specifies no I/O is available for non-real-time I/O. Clients must now obtain a non-real-time token before they can do any non-real-time I/O.
After sending out all callbacks, the FSM sets a timer based on the RtTokenTimeout value, which by default is set to 1.5 seconds. If all clients respond to the callbacks within the timeout value the RTIO request succeeds, and a response is set to the requesting client.
In the above diagram, a process on client A requests some amount of RTIO in Step 1. Since this is the first request, the FSM issues callbacks to all connected clients (Steps 2-5) informing them that the stripe group is now in real-time mode. The clients respond to the FSM in Steps 6-9. After all the clients have responded, the FSM responds to the original requesting client in Step 10.
If the timer expires and one or more clients have not responded, the FSM must retract the callbacks. It issues a response to the requesting client with the IP number of the first client that did not respond to the callback. This allows the requesting client to log the error with the IP number so system administrators have a chance of diagnosing the failure. It then sends out callbacks to all the clients to which it first sent the callbacks, retracting them to the original state. In our example, it would set the stripe group back to non-real-time mode.
After sending out the callbacks, the FSM waits for a response using the RtTokenTimeout value as before. If a client again does not respond within the timeout value, the callbacks are retracted and sent out again. This repeats until all clients respond. During this time of token retractions, real-time requests cannot be honored and will only be enqueued.
The FSM must handle a case where a client does not respond to a callback within the specified timeout period (RtTokenTimeout). If a client does not respond to a callback, the FSM must assume the worst: that it is a rogue that could wreak havoc on real-time I/O. It must retract the tokens it just issued and return to the previous state.
As mentioned earlier, the original requestor will receive an error (EREMOTE) and the IP address of the first client that did not respond to the callback. The FSM enters the token retraction state, and will not honor any real-time or token requests until it has received positive acknowledgement from all clients to which it originally sent the callbacks.
In Figure 2, Client A requests some amount of RTIO as in Figure 1. However, assume that Client C did not respond to the initial callback in time (step 7). The FSM will return a failure to Client A for the initial RTIO request, then send out callbacks to all clients indicating the stripe group is no longer real-time (steps 11-14). In the example, Client C responds to the second callback, so the FSM will not send out any more callbacks. The stripe group is back in non-real-time mode.
Note that this can have interesting repercussions with file systems that are soft mounted by default (such as Windows). When the caller times out because other clients are not responding and then gives up and returns an error to the application, if at some point the FSM is able to process the RTIO request it may result in the stripe group being put into real-time mode after the original caller has received an error code. Both the FSM and clients log their actions extensively to syslog, so if this situation arises it can be detected.
In Figure 2, if the stripe group were already in real-time mode the FSM would only send out callbacks to those clients that already have tokens. Once all clients responded to the token callbacks, the stripe group would be back in its original state.
A token grants a client some amount of non-real-time I/O for a stripe group. Tokens are encapsulated in callback messages from the FSM. Initially, no tokens are required to perform I/O. Once a stripe group is put into real-time mode, the FSM sends callbacks to all clients informing them that they will need a token to perform any non-real-time I/O. The first I/O after receiving the callback will then request a non-real-time I/O token from the FSM.
The FSM calculates the amount of non-real-time bandwidth using the following formula:
In the above calculation, rvio_current is the total bandwidth reserved by current RVIO clents. The amount of existing real-time I/O (rtio_current) has already been adjusted with the reserve parameter. As each client requests a non-real-time I/O token, the number of clients increases (current_num_nonrtio_clients in the above formula) and the amount of available non-real-time I/O decreases.
Each time there is a change in the amount of non-real-time I/O available, the FSM sends callbacks to the clients with tokens. It is important to note that unlike the initial set of callbacks where the FSM sent callbacks to all connected clients, it is now only necessary to send callbacks to those clients that have an existing token.
Once a client has a token, it can perform as much I/O per second as is allowed by that token. It does not need to contact the FSM on every I/O request. The FSM will inform the client whenever the token changes value.
In Figure 3, assume the stripe group is already in real-time mode as a result of an RTIO request from client A. Clients B and D are doing non-real-time I/O to the stripe group and have a token that specifies the amount of non-real-time I/O available. Client C then requests a non-real-time I/O token in Step 1. The FSM calls back to Clients B and D and specifies the new token amount in Steps 2-3. The clients respond in Steps 4-5, acknowledging the new token amount. The FSM then responds to Client C with the new token in Step 6.
There are two major failures that affect QOS: FSM crashes and client crashes. These can also be loss of communication (network outages). For client and server failures, the system attempts to readjust itself to the pre-failure state without any manual intervention.
If the FSM crashes or is stopped, there is no immediate affect on real-time (ungated) I/O. As long as the I/O does not need to contact the FSM for some reason (attribute update, extent request, etc.), the I/O will continue. From the standpoint of QOS, the FSM being unavailable has no affect.
Non-real-time I/O will be pended until the FSM is re-connected. The rationale for this is that since the stripe group is in real-time mode, there is no way to know if the parameters have changed while the FSM is disconnected. The conservative design approach was taken to hold off all non-real-time I/O until the FSM is reconnected.
Once the client reconnects to the FSM, the client must re-request any real-time I/O it had previously requested. The FSM does not keep track of QOS parameters across crashes; that is, the information is not logged and is not persistent. Therefore, it is up to the clients to inform the FSM of the amount of required RTIO and to put the FSM back into the same state as it was before the failure.
In most cases, this results in the amount of real-time and non-real-time I/O being exactly the same as it was before the crash. The only time this would be different is if the stripe group is oversubscribed. In this case, since more RTIO had been requested than was actually available, and the FSM had adjusted the request amounts, it is not deterministically possible to re-create the picture exactly as it was before. Therefore, if a deterministic picture is required across reboots, it is advisable to not over-subscribe the amount of real-time I/O.
The process of each client re-requesting RTIO is exactly the same as it was initially; once each client has reestablished its RTIO parameters, the non-real-time I/O is allowed to proceed to request a non-real-time token. It may take several seconds for the SAN to settle back to its previous state. It may be necessary to adjust the RtTokenTimeout parameter on the FSM to account for clients that are slow in reconnecting to the FSM.
When a client disconnects either abruptly (via a crash or a network partition,) or in a controlled manner (via an unmount), the FSM releases the client's resources back to the SAN. If the client had real-time I/O on the stripe group, that amount of real-time I/O is released back to the system. This causes a series of callbacks to the clients (all clients if the stripe group is transitioning from real-time to non-real-time,) informing them of the new amount of non-real-time I/O available.
If the client had a non-real-time I/O token, the token is released and the amount of non-real-time I/O available is recalculated. Callbacks are sent to all clients that have tokens informing them of the new amount of non-real-time I/O available.
While it is not a failure case, the handling of a client token release is exactly the same as in the case where the client disconnected. All clients retain non-real-time tokens for a fixed amount of time. The default is 60 seconds. This can be controlled via the nrtiotokentimeout mount option. After the specified period of inactivity (i.e., no non-real-time I/O on the stripe group), the client will release the token back to the FSM. The FSM will re-calculate the amount of non-real-time bandwidth available, and send out callbacks to other clients.
Therefore, if a situation exists where a periodic I/O operation occurs every 70 seconds, it would be beneficial to set the nrtiotokentime mount option to something greater than or equal to 70 seconds to cut down on system and SAN overhead.