How to Investigate Continuous Files Issues on DXi (DRAFT)

Overview

The Replication of backup data from Source DXi to Target DXi have a few trigger paths. These paths are a disconnected from the actual files stored by the customer to the Data Storage elements on the DXi. The Storage elements are denoted as tags. Replication uses these tag lists to control race conditions between disparate processes, space reclamation (delete), replication, healthcheck, just to name a few.

Transferring a /snfs/replication/source/srchost/partitions/PART_NAME/continuous file from the Source to Target DXi generates a HoldLinks tag file for the purpose of protecting BLOBs from the space reclamation processes. In essence, these tags identify data in flight from Source to Target. This process of replication can take time. You should perform a sanity check on the continuous tag list and system resources. Before escalation to Engineering, the following investigation outlined in this topic should be performed. Please collect the data from both DXis, and yes, this must be escalated to Engineering for review.

Example Problem

Taken from a field issue, a DXi started as 56 TB capacity system and shows 50 TB Used. 800 GB is available. The Target is filling up. After several space reclamation runs, the Target system was nearly full and not deleting any tags from system.

Investigating the Problem

Looking at the blockpool size on both Target and Source is a good starting point. Why? Well, we need to know the aproximate size of the Data pools that could be transferred. Realizing that the DXi can be reconfigured over time, clean-ups may not happen correctly. This should set off alarm bells. In this example, the Target DXi has larger blockpool than the Source. This can be legitimate for several reasons (the customer may be holding onto older data sets, the target might be performing ingest, just to name a few possibilities). Look at the ./app-info/BPQuickReport, as shown in the example below.

>>Blockpool data size (BLOBs) Target= 2,830,186

>>Blockpool data size (BLOBs) Source= 1,647,864

With the possibilities reviewed with the customer, the next step is to evaluate the /snfs/tmp directory for Holdlinks directory and ./Hold<BLOBstring> looking for large or old taglist files. Once the old Holdlinks have been cleaned up, look next at /snfs/replication/target/HOST.com/partitions/

This target now has a hold for 2.9 million tags.

>> []# pwd

>> /snfs/replication/target/HOST.com/partitions/dxi-vtl01

>> [root@nydux-dxi03 dxi01-vtl01]# wc -l continuous

>> 2973859 continuous

Now, it's possible that not all the tags have been replicated over from Source to Target, so a tag file of 2.9 million, which is larger than the Blockpool size on the Target, is possible. But the Source only has 1.6 million tags.

Looking at the /snfs/tmp for HoldLink files this looks a bit large with 98137347/33=2973859 tags

-rw-r--r-- 2 root root 98137347 Oct 22 17:56 HoldLink4e87364507d0b28e9bcc9f9d32fd11d3

Next, follow this to the Source DXi and look at what the /snfs/replication/source shows. The next step is to monitor growth, and then re-enable replication, and start snapshot. See below.

>> [source]# pwd

/snfs/replication/source/srchost/partitions/dxi-vtl01

-rw-rw-r-- 1 root root 99785433 Oct 25 12:40 /scratch/continuous

After the customer enabled replication on Target system, the Source started. BPGC cleanup of the old 245 image was required, starting new list of continuous 246.

>> [target]# cat /data/PrepostTagHoldList

/snfs/replication/target/pscux-dxi01.Lazard.com/partitions/dxi01-vtl01/continuous.backup,/snfs/tmp/HoldLinks/HoldLinkb7ec03b49d442c75f792bc95f3e75704,2525370991,252537107

/snfs/replication/target/pscux-dxi01.Lazard.com/partitions/dxi01-vtl01/continuous,/snfs/tmp/HoldLinks/HoldLink4e87364507d0b28e9bcc9f9d32fd11d3,2525371077,18446744073709551615

>> INFO - 10/25/13-13:29:39 - bpgc NamespaceManager.cpp(504) [bpgc] cleanupOldNamespacesOfNode() - cleanupOldNamespacesOfNode: crossed maxSavedReps, removing the bundle LocationType: source

Host: srchost

NodeType: partitions

Name: dxi-vtl01

UID: 245

Tag:

Zone: /snfs/replication/source/srchost/partitions/dxi-vtl01/245

Temp: /snfs/tmp/replication-expansion1085356352_1382722179/source/srchost/partitions/dxi-vtl01/245

Rtvd: 0

total 192

drwxrwxrwx 2 root root 2068 Oct 25 13:24 246

-rw-rw-rw- 1 root root 6204 Oct 25 13:30 continuous

-rw-rw-r-- 1 root root 0 Oct 22 17:16 ###_ENABLED_NEXTTASK_###

-rw-rw-r-- 1 root root 4 Oct 25 13:24 oneup

Review

This example shows that a Hold was requested by replication from the Source system of 2.9 million tags. After deleting this file and restarting replication, we then found a replacement Holdlinks on the target with 3.2 Million tags on image 245. So the Target cannot delete any tags because the Source system is saying Hold onto this list of tags. Remember that the Source sysetm has only 1.6 million tags in the blockpool. By deleting the old continuous file on the Source system, and then starting another snapshot, this resolved the issue of the Source running the Target out of space.

Additional Information

Reference PTR: 31733, 31605, 28477, 16597