Replication: Using the Trigger-Based Replication Queue to Determine If File/Cartridge Based Replication Is Working

Overview

In this document we will review how to view a trigger-based replication (TBR) queue to see if trigger-based replication is working correctly. As part of the discussion, we'll cover what you should expect to see, both in the source system logs and target system logs. You can check to see if trigger-based replication is working correctly either by examining live systems, or by using collect logs from the systems. We will review the commands to use on a live system to view the existing trigger queue, and will discuss what to search for in the collect.txt.

Note: What we are calling trigger-based replication is referred to as File/Cartridge Based Replication in the DXi user interface and in Quantum's documentation and training.

Logs to view:

/hurricane/tsunami.log
Collect.txt (for 2.2.x and higher only)

Identifying the Trigger Queue

The main thing is to verify in tsunami log whether the enabled shares for TBR are receiving new trigger requests. If a share has TBR enabled, upon ingest, you should expect an entry in the source system /hurricane/tsunami.log that shows that a request has been received into the trigger queue:

tsunami.log.1:INFO - 09/04/13-01:41:37 - TriggerMessenger TriggerMessenger.cpp(481) [replicationd] Trigger_Handler() - Got Trigger Message for sharition: ESITE,

Id: ESITE, path: /snfs/ddup/shares/ESITE/esite09/full/eServiceDotNetBuckinghamPSBW/eServiceDotNetBuckinghamPSBW_backup_20130904004127.bak, isDelete: false

The request will sit in the queue and will be replicated to the target using the FIFO (First In, First Out) model. This applies to all enabled shares for TBR. There is only one queue, and it is worked serially. (In firmware 2.3 trigger replication will be able to work multiple requests in parallel, but as of 2.2.1.3 and below it will work on one request at a time in the queue. 2.3 will still operate FIFO.)

As shown below, once a request in the queue is up for replication, you should expect to see this in the source system's /hurricane/tsunami.log.

Notice the TriggerId or TID. You can use this to verify if this request is still in the queue or has already been processed, by using the ‘redb_util’ command on the live box, or by searching the TID in the collect.txt of a collect log.

Also, note that this trigger request wasn’t replicated out of the queue till about 32 hours later. This shows that there is a TBR backlog, and the queue can’t keep up with the amount of data being ingested and queued for replication. TBR backlogs, and how to troubleshoot them, will be discussed in another article.

INFO - 09/05/13-09:46:58 - Triggerd ReplicationAPI.cpp(7008) [replicationd] replicateObject() - TriggerId 2214065, TriggerName ESITE, nodeType 0, SharitionName ESITE,

SharitionPath /snfs/ddup/shares/ESITE/esite09/full/eServiceDotNetBuckinghamPSBW/eServiceDotNetBuckinghamPSBW_backup_20130904004127.bak, sourceIPAddress

172.22.240.48, destIPAddress 172.22.240.49

INFO - 09/05/13-09:46:58 - Triggerd ReplicationAPI.cpp(7011) [replicationd] replicateObject() - [TID 2214065] request to replicate

/snfs/ddup/shares/ESITE/esite09/full/eServiceDotNetBuckinghamPSBW/eServiceDotNetBuckinghamPSBW_backup_20130904004127.bak to 172.22.240.49

INFO - 09/05/13-09:46:58 - Triggerd ReplicationAPI.cpp(7091) [replicationd] replicateObject() - [TID 2214065] Trigger path: esite09/full/eServiceDotNetBuckinghamPSBW/eServiceDotNetBuckinghamPSBW_backup_20130904004127.bak

INFO - 09/05/13-09:46:58 - Triggerd ReplicationAPI.cpp(7220) [replicationd] replicateObject() - [TID 2214065] held AttrBall at directory

/snfs/tmp/trigger/replication/ESITE-2214065 using holdTagRefsForAttrBall.

INFO - 09/05/13-09:46:58 - Triggerd ReplicationAPI.cpp(7292) [replicationd] replicateObject() - [TID 2214065] successfully created AttrBall V3 bundle

INFO - 09/05/13-09:46:58 - Triggerd ReplicationAPI.cpp(7336) [replicationd] replicateObject() - [TID 2214065] held AttrBall at directory

/snfs/tmp/trigger/replication/ESITE-2214065/metatardir using holdTagRefsForAttrBall.

INFO - 09/05/13-09:46:59 - replicationd ReplicationUtil.cpp(139) [replicationd] sendUndedupFileToTarget() - SendUndedupFileToTarget():

/snfs/tmp/trigger/replication/ESITE-2214065/tagfile to /snfs/tmp/trigger/replication/ESITE-2214065/continuous and statusInfo

INFO - 09/05/13-09:46:59 - Triggerd ReplicationAPI.cpp(7727) [replicationd] replicateObject() - [TID 2214065] successfully replicated

/snfs/ddup/shares/ESITE/esite09/full/eServiceDotNetBuckinghamPSBW/eServiceDotNetBuckinghamPSBW_backup_20130904004127.bak to target 172.22.240.49

INFO - 09/05/13-09:46:59 - replicationd ReplicationUtil.cpp(139) [replicationd] sendUndedupFileToTarget() - SendUndedupFileToTarget():

/snfs/tmp/trigger/replication/ESITE-2214065/bundle.tar to /snfs/tmp/trigger/replication/ESITE-2214065/bundle.tar and statusInfo CA96B9871E36D08388C350B0129A5870

INFO - 09/05/13-09:46:59 - Triggerd ReplicationAPI.cpp(7996) [replicationd] replicateObject() - [TID 2214065] sending unpack request to target: 172.22.240.49: details:

trigger name ESITE, share/parition ESITE, path /snfs/ddup/shares/ESITE/esite09/full/eServiceDotNetBuckinghamPSBW/

eServiceDotNetBuckinghamPSBW_backup_20130904004127.bak, bundle tag CA96B9871E36D08388C350B0129A5870

INFO - 09/05/13-09:46:59 - replicationd ReplicationAPI.cpp(6924) [replicationd] cleanupReplicateObject() - dropTagRefsForAttrBall() of /snfs/tmp/trigger/replication/ESITE-

2214065/metatardir is dropped

INFO - 09/05/13-09:46:59 - replicationd ReplicationAPI.cpp(6948) [replicationd] cleanupReplicateObject() - dropTagRefsForAttrBall() of /snfs/tmp/trigger/replication/ESITE-

2214065 is dropped

In the target system, /hurricane/tsunami.log will show the following. Use the TID to follow on the target, as well:

INFO - 09/05/13-09:46:59 - re_message ReplicationUtil.cpp(473) [re_message] recvReplicateFileFromSource() - Received prepost file: /snfs/tmp/trigger/replication/ESITE-2214065/continuous. Sending vpmessage to GC

INFO - 09/05/13-09:46:59 - re_message ReplicationAPI.cpp(13206) [re_message] recv_triggerReplicationComplete() - Sending Trigger Recovery message: triggerID= 2214065

WARN - 09/05/13-09:46:59 - aud ReplicationAPI.cpp(5054) [aud] oper_triggerRecoverObject() - [TID 2214065] sharitionName: ESITE, repObjectPath: /snfs/ddup/shares/ESITE/

esite09/full/eServiceDotNetBuckinghamPSBW/eServiceDotNetBuckinghamPSBW_backup_20130904004127.bak, bundle: /snfs/tmp/trigger/replication/ESITE-2214065/bundle.tar,tgtRecoverDir: /snfs/ddup/shares/ESITE/esite09/full/eServiceDotNetBuckinghamPSBW, relativeRecoverDir: esite09/full/eServiceDotNetBuckinghamPSBW

INFO - 09/05/13-09:46:59 - aud ReplicationAPI.cpp(13319) [aud] remove_triggerContTags() - dropPrepostedTagRefs(): continuousFilename

[/snfs/tmp/trigger/replication/ESITE-2214065/continuous] is done and removed

Useful DXi Commands

On a live system, you can check the existing trigger queue by executing the following commands, to print to the screen all the requests in the queue. Please note that if there are thousands, it would be best to redirect to a file. Obviously, you can do a word count to get a total #.

/hurricane/redb_util –t td –l <<<This prints the existing trigger requests in the queue >>>
- Example:

trigger_id=2228255,sharition_name=ENWISEN,sharition_type=share,sharition_identifier=ENWISEN,node_id=,node_num=0,source_ip=172.22.240.48, target_ip=172.22.240.49,start_time=1378319442,end_time=0,error_string=,status=Queued,barcode=,path=/snfs/ddup/shares/ENWISEN/enwisen01/diff/model/model_backup_20130904143032.bak

/hurricane/redb_util –t td –L <<<This prints all requests that have been completely processed through the queue, both failed and successful. This is helpful to verify that we’re successfully replicating the request.>>>
- Example: trigger_id=2214065,sharition_name=ESITE,sharition_type=share,sharition_identifier=ESITE,node_id=,node_num=0,source_ip=172.22.240.48,target_ip=172.22.240.49, start_time=13782732, status=Success,barcode=,path=/snfs/ddup/shares/ESITE/esite09/full/eServiceDotNetBuckinghamPSBW/eServiceDotNetBuckinghamPSBW_backup_20130904004127.bak

/hurricane/syscli --getstatus trigger –source <<<Also prints the existing trigger queue. Note that if you do a ‘wc –l’ most of the time the output will be 5 more than the output of '/hurricane/redb_util –t td –l'. By default there are 5 extra lines added into the output of the syscli command.>>>

Another command, which I've found to be the quickest way to see what was in the queue, is to query the linter database:

echo "select count(*) from TDREQUESTQUEUE;"|inl -u repuser/ <<<This shows the number of rows in the table, which should match the output of 'redb_util –t td –l | wc –l' >>>

Example:

[root@RED-DXi0v221 DXi]# echo "select count(*) from TDREQUESTQUEUE;"|inl -u repuser/
INteractive Language v.5.9 for RDBMS Linter SQL v.5.9
Copyright (C) 2000-2004 RelexUS, Inc. All rights reserved.
Copyright (C) 1995-2003 Relex, Inc. All rights reserved.

INL : start time : 12:34:50 end time : 12:34:50

| 0| <<<This # here will show the amount of requests in the queue
INL : number of rows displayed: 1

If you have a collect log, you can also check the TBR queue by looking into the collect.txt. Starting in 2.2.x, we started to include the following in the collect.txt:

###09:47:39### -Trigger based replication- '/hurricane/redb_util -t aud -L':

###09:47:40### -Trigger based replication- '/hurricane/redb_util -t aud -l':

###09:47:40### -Trigger based replication- '/hurricane/redb_util -t sync_src -l':

###09:47:40### -Trigger based replication- '/hurricane/redb_util -t sync_src -L':

###09:47:40### -Trigger based replication- '/hurricane/redb_util -t td -L':

###09:48:52### -Trigger based replication- '/hurricane/redb_util -t td -l':

You can use the command ‘grep status=Queued /scratch/collect/node1-collection/collect.txt | wc –l’ to get a count of how many trigger requests were in the queue at the time this collect was taken.

Example:

trigger_id=2228255,sharition_name=ENWISEN,sharition_type=share,sharition_identifier=ENWISEN,node_id=,node_num=0,source_ip=172.22.240.48,target_ip=172.22.240.49,start_time=1378319442,end_time=0,

error_string=,status=Queued,barcode=,path=/snfs/ddup/shares/ENWISEN/enwisen01/diff/model/model_backup_20130904143032.bak

From the tsunami.log, you can get the TriggerId (TID) and grep TID from the collect.txt, to see where a specific request is in the replication process.

From the log sequence above, using TID 2214065, we can grep TID from collect.txt and see it has been successfully replicated to the target:
- trigger_id=2214065,sharition_name=ESITE,sharition_type=share,sharition_identifier=ESITE,node_id=,node_num=0,source_ip=172.22.240.48,target_ip=172.22.240.49, start_time=1378273297,end_time=1378388819,error_string=,status=Success,barcode=,

path=/snfs/ddup/shares/ESITE/esite09/full/eServiceDotNetBuckinghamPSBW/eServiceDotNetBuckinghamPSBW_backup_20130904004127.bak

If you want to see how many of each share are in the queue, you just need to grep for the sharition_name:

trigger_id=2228255,sharition_name=ENWISEN,sharition_type=share,sharition_identifier=ENWISEN,node_id=,node_num=0,source_ip=172.22.240.48,target_ip=172.22.240.49, start_time=1378319442,end_time=0,error_string=,status=Queued,barcode=,path=/snfs/ddup/shares/ENWISEN/enwisen01/diff/model/model_backup_20130904143032.bak

This example shows that we have a NAS share called “ENWISEN”. If you wanted to see how many requests were in the queue, from the collect.txt you would grep sharition_name=ENWISEN and status=Queued:

grep ‘sharition_name=ENWISEN’ /scratch/collect/node1-collection/collect.txt | grep status=Queued

If you're on a live system, you can do the following:

/hurricane/redb_util –t td –l | grep sharition_name=ENWISEN

You can do this for every share/partition on which TBR is enabled.

Additional Resources

See the following TechQi doc by Fred Rybczynski on File and Cartridge-based Replication. This links takes you to the comments at the end of the article.

http://ppowportalv1.quantum.com/TechnicalQi/index.php/2013/01/23/vtl-cartridge-size-recommendation-for-fcr/#comments

Future Qwikipedia articles will cover troubleshooting TBR backlogs and trigger replication, with SR and PTR examples.