Path-To-Tape Failure in NetBackup Environment with SCSI Reservation Conflict Errors (DRAFT)

 

SR Information: SR 1610340
 

Product / Software Version and Configuration:  2.2.1 on DXi6540, OST 2.6, NDMP Path-To-Tape (PTT), Scalar i6000 Tape library. Additional configuration details: The Scalar i6000k library connected to DXi6540 has 8 tape drives configured for drive sharing between NetBackup master and media servers and one NetApp filer.
 

Problem Description: PTT stopped/not working on NetBackup master/media servers with reservation conflict errors.

 

Related PTR: 32483

 

 


Overview

A customer reported that PTT stopped working after it had been previously working. 

 

 


Symptoms/Troubleshooting

The DXi logs were filled with:

 

/var/log/messages:

Sep 15 06:29:18 tmwdxihou01 kernel: st3: Error 18 (sugg. bt 0x0, driver bt 0x0, host bt 0x0).

Sep 15 06:29:19 tmwdxihou01 kernel: st 5:0:4:0: reservation conflict

Sep 15 06:29:19 tmwdxihou01 kernel: st3: Error 18 (sugg. bt 0x0, driver bt 0x0, host bt 0x0

...

Sep 14 15:30:35 tmwdxihou01 kernel: st3: Error 18 (sugg. bt 0x0, driver bt 0x0, host bt 0x0).

Sep 14 15:30:36 tmwdxihou01 kernel: st 5:0:3:0: reservation conflict

Sep 14 15:30:36 tmwdxihou01 kernel: st6: Error 18 (sugg. bt 0x0, driver bt 0x0, host bt 0x0).

 

 

/var/log/DXi/tsunmi.log:

ERROR  - 09/14/13-14:32:58 - ndmp tape.cc(475) [ndmp-23234] ndmpdTapeOpen() - opening tape device: /dev/alias/nst/F0024E4013, Device or resource busy (16).

ERROR  - 09/14/13-15:15:34 - ndmp tape.cc(475) [ndmp-23517] ndmpdTapeOpen() - opening tape device: /dev/alias/nst/F0024E400D, Input/output error (5).

ERROR  - 09/14/13-15:15:36 - ndmp tape.cc(404) [ndmp-23531] ndmpdTapeOpen() - opening tape device: /dev/alias/nst/F0024E400D, Input/output error (5).

ERROR  - 09/14/13-15:15:39 - ndmp tape.cc(475) [ndmp-23534] ndmpdTapeOpen() - opening tape device: /dev/alias/nst/F0024E401F, Input/output error (5).

ERROR  - 09/14/13-15:15:40 - ndmp tape.cc(475) [ndmp-23538] ndmpdTapeOpen() - opening tape device: /dev/alias/nst/F0024E4007, Input/output error (5).

...

ERROR  - 09/14/13-16:49:19 - ndmp config.cc(641) [ndmp-10131] ndmpdConfigGetTapeInfo() - ndmpdConfigGetTapeInfo: scandir of /dev/alias/nst/4:0:3:0 failed(2) - No such file or directory

ERROR  - 09/14/13-16:49:19 - ndmp config.cc(641) [ndmp-10131] ndmpdConfigGetTapeInfo() - ndmpdConfigGetTapeInfo: scandir of /dev/alias/nst/5:0:3:0 failed(2) - No such file or directory

WARN   - 09/14/13-16:49:19 - ndmp config.cc(683) [ndmp-10131] ndmpdConfigGetTapeInfo() - Device /dev/alias/nst/F0024E4001/5:0:4:0 not Configured.

WARN   - 09/14/13-16:49:19 - ndmp config.cc(683) [ndmp-10131] ndmpdConfigGetTapeInfo() - Device /dev/alias/nst/F0024E4007/4:0:3:0 not Configured.

 

 

The messages file for the NetBackup master and media servers also showed quite a bit of reservation conflict errors. In addition, the NetBackup “vmoprcmd“ when run on any of the NetBackup hosts (master or media) server participating in drive sharing showed the status of the  tape drives as “PEND-TLD” with one media server, in addition, showing some drives status as “DOWN-TLD”.

 

 


Resolution

All servers that accessed drives on the Scalar i6000 library were stopped. All media was unloaded from the tape drives. All tape drives were then recycled. Special device files that did not have tape drive serial numbers (example: /dev/alias/nst/4:0:3:0) were deleted and reconfigured. (TS: Is this a summary of what needs to be done below? Or are these pre-requisite steps that need to be done prior to starting step 1 below?)

 

To resolve this issue, do the following: 

 

  1. Remove invalid device handle:

                echo "scsi remove-single-device 4 0 3 0" >/proc/scsi/scsi <== for each drive

 

  1. Remove drives files without S/N:

                rm -rf /dev/alias/nst/4:0:3:0

 

  1. Rescan bus to repopulate drives with S/N:

               /opt/DXi/rescan-scsi-bus -r

 

  1. Stop all servers than access the tape drives.
  2. Scan PTT via webGUI. (TS: Can we add navigation steps to the GUI page? For example, Configuration > PTT > Physical Device Discovery.)
  3. Reconfigure the tape library through NetBackup Administration Console, including all the hosts for drive sharing.

The configuration should result in the tape library (TLD) configured with drives status as normal. However, upon running backup jobs to the tape drives, the drives status will change to PEND-TLD and backup jobs will queue up.

 

  1. Stop any backup jobs.
  2. Run “vmoprcmd“ on each host. This should change the drive status to “DOWD-TLD" on the media servers.
  3. Delete and reconfigure the tape library definition (TLD) in NetBackup, excluding the media server with the drives status of “DOWN-TLD”.  Excluding the media server with the “DOWN-TLD” drives status resolved the problem.

NOTE: The media server was later rebooted by the customer on the advice of Symantec support and re-joined in drive sharing. It is believed that the media server with the DOWN-TLD was receiving but not releasing scsi reservations, hence the conflicts.  

 


 

 

 



This page was generated by the BrainKeeper Enterprise Wiki, © 2018