Locating Replication Issues

Overview

If the customer is having replication issues, do the following:

 

  1. Make sure that replication is set up correctly. This can be verified in the replication.conf section of the collect.txt log on both the source and target DXi system:

### -Replication- 'cat /data/hurricane/replication.conf':
[Global_Global]
CvfsMountPoint=/snfs/Q/
DedupWindowDuration=960
DedupWindowScheduledTimesForSystem=05:00
EncryptionEnabled=false
IsReplicationForSystemEnabled=false
ProgramReplicationIsPaused=false
QbfsMountPoint=/Q/
ReplicationScheduledTimesForSystem=
SourceHostList=10.212.17.27,10.254.246.27
UserReplicationIsPaused=false

 

[Share_1]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=AB10_NASTEST
ReplicationDestinations=10.212.17.27
ReplicationEnabled=false
ReplicationRole=3
ReplicationScheduledTimes=
RetentionAge=120
ShareType=0
TriggerReplication=0
TriggerReplicationId=

 

[Share_2]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=AB10_NAS002
ReplicationDestinations=10.212.17.27
ReplicationEnabled=true
ReplicationRole=3
ReplicationScheduledTimes=17:00
RetentionAge=120
ShareType=0
TriggerReplication=0
TriggerReplicationId=

 

  1. Then make sure that the source and target are communicating.

Use the ping command to check connectivity from source to target, and vice versa, from target to source.

Example: ping 10.212.17.27 (target IP address). Do the same from target to source.

 

Use the telnet command to telnet from source to target on port 1062

Example: telnet <target ip> 1062

Example: telnet <target ip> 80
 

  1. Check the tsunami.log file and look for error messages. 

Examples of Replication Failures

 

Example 1

 

ERROR  - 03/30/10-22:02:15 - replicationd DestinationThread.cpp(918) [replicationd] continuousReplication() - {1082132832} Continuous data replication activity has failed  to target host: 10.212.17.27
Error details: Network error

 

Example 2

 

ERROR  - 04/10/10-15:51:03 - replicationd DestinationThread.cpp(1749) [replicationd] namespaceReplication() - {1082132832} QLOG_REP_ERROR - Namespace Replication FAILED for task: destinationId=10.212.17.27, launchTime=Sat Apr 10 15:00:00 2010(1270933200), nodeName=AB10_NAS003, nodeId=, nodeType=Share(0), taskMode=Scheduled(3), nodeDirectory: /Q/shares/AB10_NAS003, eventHostId: AB10-DXI004.Celero.ca, eventBundleUID: 0, eventBundleUID: 130, synchronizeId:  DETAILS:
WARN   - 04/10/10-15:51:03 - replicationd DestinationThread.cpp(1776) [replicationd] namespaceReplication() - {1082132832} Finalizing Source namespace bundle for FAILED Namespace Replication name: AB10_NAS003 destination: 10.212.17.27 error:


Useful Commands When Looking for Replication Issues

The following section provides some commands that can be used to look for replication issues directly from the DXi system:

 

From the system

 

ssh to the DXi and from the command prompt type in the following commands:


less /hurricane/tsunami.log

tail -f /hurricane/tsunami.log

grep replicationd /hurricane/tsunami.log | tail -10

 

EXAMPLE OUTPUT

  

Example 1

 

This example provides the actiivty going on in the tsunami log, replication activity,  and error messages. This command is useful if information from one or more previous months is needed:

 

INFO   - 01/12/11-17:16:39 - replicationd REDaemon.cpp(313) [replicationd] initialize() - Signals blocked
INFO   - 01/12/11-17:16:39 - replicationd REDaemon.cpp(528) [replicationd] doInitialize() - rm cmd is : rm -fr /snfs/tmp/replication/namespace/*
INFO   - 01/12/11-17:16:39 - replicationd REDaemon.cpp(534) [replicationd] doInitialize() - rm cmd is : rm -fr /Q/partitions/namespace_*
INFO   - 01/12/11-17:16:39 - replicationd BfstV2.cpp(67) [replicationd] initialize() - Bfst initialize needed - calling bfst2_initialize
INFO   - 01/12/11-17:16:39 - replicationd REDaemon.cpp(623) [replicationd] startAllChildThreads() - Launching Scheduler Thread
INFO   - 01/12/11-17:16:39 - replicationd REDaemon.cpp(627) [replicationd] startAllChildThreads() - Launching Command Thread
INFO   - 01/12/11-17:16:39 - replicationd REDaemon.cpp(1607) [replicationd] handleProgrammaticPause() - Programmatic Pause: setting paused to false
INFO   - 01/12/11-17:16:39 - replicationd ReplicationAPI.cpp(14584) [replicationd] setGlobalReplicationPauseStatus() - System has resumed the replication service1

 

Press SHIFT-G  to go to the bottom of the file. You can then scroll up and then down as needed:

 

INFO   - 03/08/11-13:00:19 - bpgc ReconcileThread.cpp(1327) [bpgc] dumpReplicatedNamespaceTags() - {1082132832} Gathering the tags from all replicated namespaces
INFO   - 03/08/11-13:00:19 - bpgc NamespaceBundleUtil.cpp(262) [bpgc] untar() - Successfully untared namespace bundle: /snfs/tmp/replication-expansion1082132832_1299589219/target/DXi5-02nh.quantum.com/shares/keithR2/1/bundle.tar
INFO   - 03/08/11-13:00:19 - bpgc NamespaceBundleUtil.cpp(262) [bpgc] untar() - Successfully untared namespace bundle: /snfs/tmp/replication-expansion1082132832_1299589219/target/DXi5-02nh.quantum.com/partitions/KP1/1/bundle.tar
INFO   - 03/08/11-13:00:19 - bpgc NamespaceBundleUtil.cpp(262) [bpgc] untar() - Successfully untared namespace bundle: /snfs/tmp/replication-expansion1082132832_1299589219/target/DXi5-02nh.quantum.com/partitions/Keiths/3/bundle.tar
INFO   - 03/08/11-13:00:19 - bpgc NamespaceBundleUtil.cpp(262) [bpgc] untar() - Successfully untared namespace bundle: /snfs/tmp/replication-expansion1082132832_1299589219/target/DXi5-02nh.quantum.com/partitions/Keiths/2/bundle.tar
INFO   - 03/08/11-13:00:19 - bpgc NamespaceBundleUtil.cpp(262) [bpgc] untar() - Successfully untared namespace bundle: /snfs/tmp/replication-expansion1082132832_1299589219/target/DXi5-02nh.quantum.com/partitions/Keiths/1/bundle.tar
 

Example 2

 

This command will show you the last and current activity going on in the tsunami.log. It is useful to get a quick idea of the current status of the system:

 

[root@DXi75-nh1 ~]# tail -f /hurricane/tsunami.log
INFO   - 03/08/11-19:05:09 - hwmon MonitorArrayStatus.cpp(2260) [hwmon] CheckFailureList() - Failure Type List: RecoveryFailureTypeValue::REC_VOLUME_HOT_SPARE_IN_USE.
INFO   - 03/08/11-19:05:09 - hwmon MonitorArrayStatus.cpp(2261) [hwmon] CheckFailureList() - Taking no action.
INFO   - 03/08/11-19:07:15 - hwmon MonitorArrayStatus.cpp(2260) [hwmon] CheckFailureList() - Failure Type List: RecoveryFailureTypeValue::REC_VOLUME_HOT_SPARE_IN_USE.
INFO   - 03/08/11-19:07:15 - hwmon MonitorArrayStatus.cpp(2261) [hwmon] CheckFailureList() - Taking no action.
INFO   - 03/08/11-19:07:15 - hwmon perfmon.cpp(527) [hwmon] PerfServerThread() - PerfServerThread: New connection from client '10.17.21.1' socket 22
INFO   - 03/08/11-19:07:15 - hwmon perfmon.cpp(774) [hwmon] ReadClientConnection() - Got get request from client '10.17.21.1' flags 0x131
INFO   - 03/08/11-19:07:16 - hwmon perfmon.cpp(527) [hwmon] PerfServerThread() - PerfServerThread: New connection from client '10.17.21.1' socket 22
INFO   - 03/08/11-19:07:16 - hwmon perfmon.cpp(774) [hwmon] ReadClientConnection() - Got get request from client '10.17.21.1' flags 0x131
INFO   - 03/08/11-19:09:21 - hwmon MonitorArrayStatus.cpp(2260) [hwmon] CheckFailureList() - Failure Type List: RecoveryFailureTypeValue::REC_VOLUME_HOT_SPARE_IN_USE.
INFO   - 03/08/11-19:09:21 - hwmon MonitorArrayStatus.cpp(2261) [hwmon] CheckFailureList() - Taking no action.
 

Example 3

 

This command is useful to narrow down the search by listing the last 10 lines in the tsunami.log:

 

[root@DXi75-nh1 ~]# grep replicationd /hurricane/tsunami.log |tail -10
INFO   - 03/07/11-03:00:02 - replicationd DestinationThread.cpp(1666) [replicationd] namespaceReplication() - {1082132832} QLOG_REP_INFO - Successful namespace replication completed for: Keiths to target: 10.105.13.77
INFO   - 03/08/11-03:00:00 - replicationd RESchedulerThread.cpp(650) [replicationd] doProcess() - {1166059872} Adding new namespace replication task [destinationId=10.105.13.77, launchTime=Tue Mar  8 03:00:00 2011(1299553200), nodeName=Keiths, nodeId=VL06CX0743BVA00005, nodeType=Partition(1), taskMode=Scheduled(3), nodeDirectory: /Q/partitions/VL06CX0743BVA00005, eventHostId: DXi75-nh1.labs.northampton.uk, eventBundleUID: 0, eventBundleUID: 21, synchronizeId: ]
INFO   - 03/08/11-03:00:00 - replicationd NamespaceReplicator.cpp(1115) [replicationd] generateNamespace() - Command /hurricane/metatar -c -f /data/hurricane//snfs/tmp/replication/namespace/Keiths/metadata -b /data/hurricane//snfs/tmp/replication/namespace/Keiths/taglist -s /data/hurricane//snfs/tmp/replication/namespace/Keiths/metastatus -w /data/hurricane//snfs/tmp/replication/namespace/Keiths/waitlist -l /data/hurricane//snfs/tmp/replication/namespace/Keiths/barcodes -t part -d /Q/partitions/VL06CX0743BVA00005 -n done, exitcode=0
INFO   - 03/08/11-03:00:00 - replicationd NamespaceReplicator.cpp(1129) [replicationd] generateNamespace() - QLOG_REP_INFO -  Complete replication initiated for Keiths to target: 10.105.13.77
INFO   - 03/08/11-03:00:00 - replicationd RemoteBfstV2.cpp(168) [replicationd] connect() - [TID=1082132832] RemoteBfst::connect '127.0.0.1'
INFO   - 03/08/11-03:00:00 - replicationd RemoteBfstV2.cpp(281) [replicationd] disconnect() - [TID=1082132832] In RemoteBfstV2::disconnect()
INFO   - 03/08/11-03:00:00 - replicationd RemoteBfstV2.cpp(168) [replicationd] connect() - [TID=1082132832] RemoteBfst::connect '127.0.0.1'
INFO   - 03/08/11-03:00:00 - replicationd RemoteBfstV2.cpp(281) [replicationd] disconnect() - [TID=1082132832] In RemoteBfstV2::disconnect()
INFO   - 03/08/11-03:00:01 - replicationd NamespaceReplicator.cpp(1923) [replicationd] sendDataToTarget() - Namespace completed for sync id:
INFO   - 03/08/11-03:00:01 - replicationd DestinationThread.cpp(1666) [replicationd] namespaceReplication() - {1082132832} QLOG_REP_INFO - Successful namespace replication completed for: Keiths to target: 10.105.13.77
[root@DXi75-nh1 ~]#
  

A similar command can be used to list the top 10 lines in the log. To do this, use head instead of tail":

 

[root@DXi75-nh1 ~]# grep replicationd /hurricane/tsunami.log |head -10
INFO   - 01/12/11-16:39:31 - procmon ProcMonInvoke.cpp(147) [procmon] Start() - Invoked command '/etc/init.d/replicationd start' (27823)
INFO   - 01/12/11-16:39:31 - procmon ProcMonInvoke.cpp(269) [procmon] Wait() - Command '/etc/init.d/replicationd start' terminated with status 0 in 0 seconds
INFO   - 01/12/11-16:39:31 - replicationd REDaemon.cpp(313) [replicationd] initialize() - Signals blocked
INFO   - 01/12/11-16:39:32 - replicationd REDaemon.cpp(528) [replicationd] doInitialize() - rm cmd is : rm -fr /snfs/tmp/replication/namespace/*
INFO   - 01/12/11-16:39:32 - replicationd REDaemon.cpp(534) [replicationd] doInitialize() - rm cmd is : rm -fr /Q/partitions/namespace_*
INFO   - 01/12/11-16:39:32 - replicationd BfstV2.cpp(67) [replicationd] initialize() - Bfst initialize needed - calling bfst2_initialize
INFO   - 01/12/11-16:39:32 - replicationd REDaemon.cpp(623) [replicationd] startAllChildThreads() - Launching Scheduler Thread
INFO   - 01/12/11-16:39:32 - replicationd REDaemon.cpp(627) [replicationd] startAllChildThreads() - Launching Command Thread
INFO   - 01/12/11-16:39:32 - VpMsg.NameService VpNameService.cc(312) [replicationd] registerName() - Registered NS_TRIGGERD, host=Qnode1, port=60373, pid=27881, ppid=1.
INFO   - 01/12/11-16:39:32 - procmon ProcMonService.cpp(893) [procmon] StartServices() - Service 'replicationd' is started
[root@DXi75-nh1 ~]#
 


What's Next?

Locating Active Directory Issues >

Notes

Left file content listings in default font. Changing them to Courier made some lines wrap very awkwardly. Bugs in the "change font size" feature kept me from using font size settings to make Courier work.

Note by Ed Winograd on 03/11/2011 05:28 PM

We need an example for a DXi4500, DXi6500, DXi7500, or DXi8500.

Note by Charlotte Taylor on 03/01/2011 01:45 PM

Eclipse example a customer can use:

 

Check on the target system that replication is running, on the Gui this is under Data Services -> Replication -> Source Role -> Actions. (Yes, source role even though this is the target.)
 If it is running then on our source system under Data Services -> Replication -> Source Role -> VTL (or NAS) click on the ‘radio’ button of any partition so it is highlighted then use ‘Check Readiness’, this will check if the two units can communicate with each other across the network, if that is successful then try the ‘Replicate Now’ button.
 
This should then work providing replication is running on the target system, if it doesn’t work and we are sure all is correct on the target then you will need to reboot the source DXi to kick start the replication services on it.

 

Note by Keith Hatton on 02/16/2011 08:46 AM


This page was generated by the BrainKeeper Enterprise Wiki, © 2018