Replication won't Start (DRAFT)

Overview

In this scenario, the customer has called with this problem: I can’t start replication for a specific partition.

Problem Description: I had an SR where partitions on a target system were deleted then recreated. Upon creation, they were given new serial numbers but the source system was still trying to replicate to the target partition using the serial number, which no longer existed.

Available Information: Tech Support received a specific error message (WHAT WAS THE ERROR MESSAGE?) on the GUI and in the logs.

Tech Support’s Initial Thoughts: Everything is configured properly but the system is still trying to use a serial number that doesn’t exist.

I started reading logs and enabled debug on the processes that I “figured” were involved in creation/deletion/modification of partitions as well as the processes related to replication to figure out why the source system was still hanging on to the old serial number.

I spent a full day looking in the Linter database and other files looking for a reference to that old serial number that needed to be updated with the new one. Once SUS2 got involved, it took another several days and I saw in their notes that they were enabling debug for process I never even realized was used on the system.

Questions from Tech Support: Engineering suggested that I try a reboot. Reboot solved the issue. Should Service/Support have looked at something else in replication files? If so, what and why?

------------------------------------------------------------

Next Steps: Responses from Engineering

First, determine if the customer is using On Demand Replication, Scheduled Replication, or Trigger (Cartridge Based) Replication. Depending on the method of replication, the method to resolve the issue will vary.

The methods described below can be applied to resolve issues both with 1.x and 2.x (though 2.x does not yet support VTL – this info could apply equally well to issues with partition or share replication).

Additional DEBUG Level Logging for the Replicated Processes

For additional DEBUG level logging for the replicationd process that controls source side replication, do the following:

mv /var/DXi/processwatcher /var/DXi/processwatcher.orig
Edit the file /hurricane/log-common.conf
- Search for replicationd and change INFO to DEBUG
- Save the changes
Issue the command: /etc/rc.d/init.d/log4cplus-server restart
Issue the command: /etc/rc.d/init.d/replicationd restart
When done, change the value back, restart these processes and move the processwatcher file back

Additional DEBUG Level Logging for the re-message Process

For additional DEBUG level logging for the re_message process that controls the target side receipt of replication status and statistic data, do the same steps as above except :

Add the following lines to the /hurricane/log-common.conf file:
- log4cplus.logger.re_message = DEBUG, ALL_APP
- log4cplus.additivity.re_mesage = false
No need to restart replicationd process (re_message is a CGI program that is run each time a CGI request is made)

If Using On Demand Replication

Use the GUI to verify that a replication target is defined
Use the GUI to verify that replication is not currently paused, If it is paused:

If the ‘Resume’ button is sensitized, select Resume and retry replication
If the ‘Resume’ button is not sensitized, then it is a “Programmatic” pause that the user did not initiate themselves.
- Use ping to ensure the target system is reachable via the IP address or Host Name used to specify the target system.
  - If it is not reachable by IP address, investigate target system/network issue
  - If it is not reachable by Host name but is reachable by IP address, investigate DNS issue
- On the target system, ensure that the source system is an allowed source host for replication

If replication target has been changed recently and replication is stuck in the ‘Queued’ state, disable replication for the partition to cancel the replication job and then enable replication and re-submit the replication.

If Using Scheduled Replication

Try to replicate the partition using On Demand Replication to verify that works. If not, follow the investigative paths given above to resolve the issue.
Schedule the replication daily at a time 10 to 15 minutes in the future. If this replication is initiated as expected, reschedule according to your needs.
In the ‘/hurricane/tsunami.log’ file, go to the bottom of the file and search from bottom to top for the first occurrence of “Adding new namespace replication task” Follow the text down from this line to help determine the reason the replication is not starting.

If Using Trigger Replication (called Cartridge Based Replication in the GUI)

Verification of correct configuration for trigger replication is largely left up to the customer. Check the following to make sure it is configured correctly:
1. Run an On Demand Replication to ensure that the target is reachable and configured to accept replicated data from this source.
2. On the source system, verify that in the Source Role VTL entry for this partition that ‘Enable Replication’ is checked and that ‘Enable Cartridge Based Replication to target Sync ID’ is checked. Note: The value of the ‘Sync ID’ must match exactly (case sensitive too) the ‘Sync ID’ specified for the partition on the target system to receive and recover the replicated data
3. On the target system, verify that in the Target Role VTL Cartridge Based Targets entry for this partition that the State is Enabled, the Sync ID exactly matches that specified on the target and that the Access is Unlocked.

In the GUI, select the Source Role VTL ‘Cartridge Based Queue’ button to see if the replication request is queued. If it was requeued, there should be text to describe why it did not complete the first time.
Verify that the replication license is installed.
In the /hurricane/tsunami.log’ file, go to the bottom of the file and search from the bottom to top for the first occurrence of ‘replicationd’ – follow that upwards to see if there is information as to why the trigger replication failed.

If the trigger replication succeeded, you should see the message that includes the text ‘Sending notification to triggerd’ If this is not directly followed by a message that contains the text ‘Exiting sendVpMessage’, then the data was successfully replicated to the target.

If the data was successfully replicated but is not recovered on the target, investigation on the target system is necessary. On the target, select the Target Role VTL ‘Unpack Queue’ button to see if the recovery request is still queued. If it is requeued, there should be text describing why the recovery did not succeed the first time.