Locating Replication Issues |
If the customer is having replication issues, do the following:
### -Replication- 'cat /data/hurricane/replication.conf':
[Global_Global]
CvfsMountPoint=/snfs/Q/
DedupWindowDuration=960
DedupWindowScheduledTimesForSystem=05:00
EncryptionEnabled=false
IsReplicationForSystemEnabled=false
ProgramReplicationIsPaused=false
QbfsMountPoint=/Q/
ReplicationScheduledTimesForSystem=
SourceHostList=10.212.17.27,10.254.246.27
UserReplicationIsPaused=false
[Share_1]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=AB10_NASTEST
ReplicationDestinations=10.212.17.27
ReplicationEnabled=false
ReplicationRole=3
ReplicationScheduledTimes=
RetentionAge=120
ShareType=0
TriggerReplication=0
TriggerReplicationId=
[Share_2]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=AB10_NAS002
ReplicationDestinations=10.212.17.27
ReplicationEnabled=true
ReplicationRole=3
ReplicationScheduledTimes=17:00
RetentionAge=120
ShareType=0
TriggerReplication=0
TriggerReplicationId=
Use the ping command to check connectivity from source to target, and vice versa, from target to source.
Example: ping 10.212.17.27 (target IP address). Do the same from target to source.
Use the telnet command to telnet from source to target on port 1062
Example: telnet <target ip> 1062
Example: telnet <target ip> 80
Examples of Replication Failures
Example 1
ERROR - 03/30/10-22:02:15 - replicationd DestinationThread.cpp(918) [replicationd] continuousReplication() - {1082132832} Continuous data replication activity has failed to target host: 10.212.17.27
Error details: Network error
Example 2
ERROR - 04/10/10-15:51:03 - replicationd DestinationThread.cpp(1749) [replicationd] namespaceReplication() - {1082132832} QLOG_REP_ERROR - Namespace Replication FAILED for task: destinationId=10.212.17.27, launchTime=Sat Apr 10 15:00:00 2010(1270933200), nodeName=AB10_NAS003, nodeId=, nodeType=Share(0), taskMode=Scheduled(3), nodeDirectory: /Q/shares/AB10_NAS003, eventHostId: AB10-DXI004.Celero.ca, eventBundleUID: 0, eventBundleUID: 130, synchronizeId: DETAILS:
WARN - 04/10/10-15:51:03 - replicationd DestinationThread.cpp(1776) [replicationd] namespaceReplication() - {1082132832} Finalizing Source namespace bundle for FAILED Namespace Replication name: AB10_NAS003 destination: 10.212.17.27 error:
The following section provides some commands that can be used to look for replication issues directly from the DXi system:
From the system
ssh to the DXi and from the command prompt type in the following commands:
less /hurricane/tsunami.log
tail -f /hurricane/tsunami.log
grep replicationd /hurricane/tsunami.log | tail -10
EXAMPLE OUTPUT
Example 1
This example provides the actiivty going on in the tsunami log, replication activity, and error messages. This command is useful if information from one or more previous months is needed:
INFO - 01/12/11-17:16:39 - replicationd REDaemon.cpp(313) [replicationd] initialize() - Signals blocked
INFO - 01/12/11-17:16:39 - replicationd REDaemon.cpp(528) [replicationd] doInitialize() - rm cmd is : rm -fr /snfs/tmp/replication/namespace/*
INFO - 01/12/11-17:16:39 - replicationd REDaemon.cpp(534) [replicationd] doInitialize() - rm cmd is : rm -fr /Q/partitions/namespace_*
INFO - 01/12/11-17:16:39 - replicationd BfstV2.cpp(67) [replicationd] initialize() - Bfst initialize needed - calling bfst2_initialize
INFO - 01/12/11-17:16:39 - replicationd REDaemon.cpp(623) [replicationd] startAllChildThreads() - Launching Scheduler Thread
INFO - 01/12/11-17:16:39 - replicationd REDaemon.cpp(627) [replicationd] startAllChildThreads() - Launching Command Thread
INFO - 01/12/11-17:16:39 - replicationd REDaemon.cpp(1607) [replicationd] handleProgrammaticPause() - Programmatic Pause: setting paused to false
INFO - 01/12/11-17:16:39 - replicationd ReplicationAPI.cpp(14584) [replicationd] setGlobalReplicationPauseStatus() - System has resumed the replication service1
Press SHIFT-G to go to the bottom of the file. You can then scroll up and then down as needed:
INFO - 03/08/11-13:00:19 - bpgc ReconcileThread.cpp(1327) [bpgc] dumpReplicatedNamespaceTags() - {1082132832} Gathering the tags from all replicated namespaces
INFO - 03/08/11-13:00:19 - bpgc NamespaceBundleUtil.cpp(262) [bpgc] untar() - Successfully untared namespace bundle: /snfs/tmp/replication-expansion1082132832_1299589219/target/DXi5-02nh.quantum.com/shares/keithR2/1/bundle.tar
INFO - 03/08/11-13:00:19 - bpgc NamespaceBundleUtil.cpp(262) [bpgc] untar() - Successfully untared namespace bundle: /snfs/tmp/replication-expansion1082132832_1299589219/target/DXi5-02nh.quantum.com/partitions/KP1/1/bundle.tar
INFO - 03/08/11-13:00:19 - bpgc NamespaceBundleUtil.cpp(262) [bpgc] untar() - Successfully untared namespace bundle: /snfs/tmp/replication-expansion1082132832_1299589219/target/DXi5-02nh.quantum.com/partitions/Keiths/3/bundle.tar
INFO - 03/08/11-13:00:19 - bpgc NamespaceBundleUtil.cpp(262) [bpgc] untar() - Successfully untared namespace bundle: /snfs/tmp/replication-expansion1082132832_1299589219/target/DXi5-02nh.quantum.com/partitions/Keiths/2/bundle.tar
INFO - 03/08/11-13:00:19 - bpgc NamespaceBundleUtil.cpp(262) [bpgc] untar() - Successfully untared namespace bundle: /snfs/tmp/replication-expansion1082132832_1299589219/target/DXi5-02nh.quantum.com/partitions/Keiths/1/bundle.tar
Example 2
This command will show you the last and current activity going on in the tsunami.log. It is useful to get a quick idea of the current status of the system:
[root@DXi75-nh1 ~]# tail -f /hurricane/tsunami.log
INFO - 03/08/11-19:05:09 - hwmon MonitorArrayStatus.cpp(2260) [hwmon] CheckFailureList() - Failure Type List: RecoveryFailureTypeValue::REC_VOLUME_HOT_SPARE_IN_USE.
INFO - 03/08/11-19:05:09 - hwmon MonitorArrayStatus.cpp(2261) [hwmon] CheckFailureList() - Taking no action.
INFO - 03/08/11-19:07:15 - hwmon MonitorArrayStatus.cpp(2260) [hwmon] CheckFailureList() - Failure Type List: RecoveryFailureTypeValue::REC_VOLUME_HOT_SPARE_IN_USE.
INFO - 03/08/11-19:07:15 - hwmon MonitorArrayStatus.cpp(2261) [hwmon] CheckFailureList() - Taking no action.
INFO - 03/08/11-19:07:15 - hwmon perfmon.cpp(527) [hwmon] PerfServerThread() - PerfServerThread: New connection from client '10.17.21.1' socket 22
INFO - 03/08/11-19:07:15 - hwmon perfmon.cpp(774) [hwmon] ReadClientConnection() - Got get request from client '10.17.21.1' flags 0x131
INFO - 03/08/11-19:07:16 - hwmon perfmon.cpp(527) [hwmon] PerfServerThread() - PerfServerThread: New connection from client '10.17.21.1' socket 22
INFO - 03/08/11-19:07:16 - hwmon perfmon.cpp(774) [hwmon] ReadClientConnection() - Got get request from client '10.17.21.1' flags 0x131
INFO - 03/08/11-19:09:21 - hwmon MonitorArrayStatus.cpp(2260) [hwmon] CheckFailureList() - Failure Type List: RecoveryFailureTypeValue::REC_VOLUME_HOT_SPARE_IN_USE.
INFO - 03/08/11-19:09:21 - hwmon MonitorArrayStatus.cpp(2261) [hwmon] CheckFailureList() - Taking no action.
Example 3
This command is useful to narrow down the search by listing the last 10 lines in the tsunami.log:
[root@DXi75-nh1 ~]# grep replicationd /hurricane/tsunami.log |tail -10
INFO - 03/07/11-03:00:02 - replicationd DestinationThread.cpp(1666) [replicationd] namespaceReplication() - {1082132832} QLOG_REP_INFO - Successful namespace replication completed for: Keiths to target: 10.105.13.77
INFO - 03/08/11-03:00:00 - replicationd RESchedulerThread.cpp(650) [replicationd] doProcess() - {1166059872} Adding new namespace replication task [destinationId=10.105.13.77, launchTime=Tue Mar 8 03:00:00 2011(1299553200), nodeName=Keiths, nodeId=VL06CX0743BVA00005, nodeType=Partition(1), taskMode=Scheduled(3), nodeDirectory: /Q/partitions/VL06CX0743BVA00005, eventHostId: DXi75-nh1.labs.northampton.uk, eventBundleUID: 0, eventBundleUID: 21, synchronizeId: ]
INFO - 03/08/11-03:00:00 - replicationd NamespaceReplicator.cpp(1115) [replicationd] generateNamespace() - Command /hurricane/metatar -c -f /data/hurricane//snfs/tmp/replication/namespace/Keiths/metadata -b /data/hurricane//snfs/tmp/replication/namespace/Keiths/taglist -s /data/hurricane//snfs/tmp/replication/namespace/Keiths/metastatus -w /data/hurricane//snfs/tmp/replication/namespace/Keiths/waitlist -l /data/hurricane//snfs/tmp/replication/namespace/Keiths/barcodes -t part -d /Q/partitions/VL06CX0743BVA00005 -n done, exitcode=0
INFO - 03/08/11-03:00:00 - replicationd NamespaceReplicator.cpp(1129) [replicationd] generateNamespace() - QLOG_REP_INFO - Complete replication initiated for Keiths to target: 10.105.13.77
INFO - 03/08/11-03:00:00 - replicationd RemoteBfstV2.cpp(168) [replicationd] connect() - [TID=1082132832] RemoteBfst::connect '127.0.0.1'
INFO - 03/08/11-03:00:00 - replicationd RemoteBfstV2.cpp(281) [replicationd] disconnect() - [TID=1082132832] In RemoteBfstV2::disconnect()
INFO - 03/08/11-03:00:00 - replicationd RemoteBfstV2.cpp(168) [replicationd] connect() - [TID=1082132832] RemoteBfst::connect '127.0.0.1'
INFO - 03/08/11-03:00:00 - replicationd RemoteBfstV2.cpp(281) [replicationd] disconnect() - [TID=1082132832] In RemoteBfstV2::disconnect()
INFO - 03/08/11-03:00:01 - replicationd NamespaceReplicator.cpp(1923) [replicationd] sendDataToTarget() - Namespace completed for sync id:
INFO - 03/08/11-03:00:01 - replicationd DestinationThread.cpp(1666) [replicationd] namespaceReplication() - {1082132832} QLOG_REP_INFO - Successful namespace replication completed for: Keiths to target: 10.105.13.77
[root@DXi75-nh1 ~]#
A similar command can be used to list the top 10 lines in the log. To do this, use head instead of tail":
[root@DXi75-nh1 ~]# grep replicationd /hurricane/tsunami.log |head -10
INFO - 01/12/11-16:39:31 - procmon ProcMonInvoke.cpp(147) [procmon] Start() - Invoked command '/etc/init.d/replicationd start' (27823)
INFO - 01/12/11-16:39:31 - procmon ProcMonInvoke.cpp(269) [procmon] Wait() - Command '/etc/init.d/replicationd start' terminated with status 0 in 0 seconds
INFO - 01/12/11-16:39:31 - replicationd REDaemon.cpp(313) [replicationd] initialize() - Signals blocked
INFO - 01/12/11-16:39:32 - replicationd REDaemon.cpp(528) [replicationd] doInitialize() - rm cmd is : rm -fr /snfs/tmp/replication/namespace/*
INFO - 01/12/11-16:39:32 - replicationd REDaemon.cpp(534) [replicationd] doInitialize() - rm cmd is : rm -fr /Q/partitions/namespace_*
INFO - 01/12/11-16:39:32 - replicationd BfstV2.cpp(67) [replicationd] initialize() - Bfst initialize needed - calling bfst2_initialize
INFO - 01/12/11-16:39:32 - replicationd REDaemon.cpp(623) [replicationd] startAllChildThreads() - Launching Scheduler Thread
INFO - 01/12/11-16:39:32 - replicationd REDaemon.cpp(627) [replicationd] startAllChildThreads() - Launching Command Thread
INFO - 01/12/11-16:39:32 - VpMsg.NameService VpNameService.cc(312) [replicationd] registerName() - Registered NS_TRIGGERD, host=Qnode1, port=60373, pid=27881, ppid=1.
INFO - 01/12/11-16:39:32 - procmon ProcMonService.cpp(893) [procmon] StartServices() - Service 'replicationd' is started
[root@DXi75-nh1 ~]#
Notes |
This page was generated by the BrainKeeper Enterprise Wiki, © 2018 |