EDLM Troubleshooting |
Michael Richter: EDLM Troubleshooting V0.3.9
Scalar i6000 Scan Policies
Enable External Application (StorNext)
– Library checks “suspect count” on dismount
– Scan performed if “suspect count” is set and above treshold
StorNext threshold
- Only scans it once at the threshold (MEDIA_SUSPECT_THRESHOLD)
- RAS Ticket generated on the library
– If scan fails “medcopy” is started
- Write protect the media on StorNext
- RAS Ticket generated on the library
Tape Alert thresholds
– 3 is the default but configurable
Time interval
– Day increments
– Be careful it’s possible to thrash the system
Scan on import
- every media - even when its known and vaulted - which is entered into the library partition will be scanned
EDLM Setup i6k (example: CGG Veritas)
EDLM Snapi Workflow
EDLM
- the EDLM Scan is transparent to StorNext. the i6k library is supposed to report the media presence in its home slot while it scans the media
SNAPI i6k PlugIn
- test system status to test connectivity
- check for suspect counts of media on dismount
- check for a vault associated with a media export
- initiate a copy operation and precede it with a media write protect
Note: we do use pass-through commands to determine suspect count and associated vaults, and use it for the copy operation so we can copy current and previous revisions.
EDLM
Enabling SNAPI logging won’t show you anything unless there are errors, so the best bet is to capture the commands as they come into SN.
You can get the command and status from the log file I listed in the table - you won’t really know if it came from a SNAPI call or not.
SNAPI command |
SN command |
SN log for command & status |
comments |
CopyMedia |
fsmedcopy –r -b |
TSM/logs/history/hist_01 |
Supply a volser and block on this call.
Not sure if this is still used. At one point there was discussion about using passthru command to run fsmedcopy –r –a -b so all files get copied instead of just the active ones. |
GetMediaStatus |
vsmedqry fsmedinfo |
MSM/logs/history/hist_01 TSM/logs/history/hist_01 |
We supply a volser and get back a MediaInfo instance.
We call getSuspectCount() and getWriteProtected() on this instance after an unload and prior to a CopyMedia command respectively.
|
GetSystemStatus |
cat /usr/adic/.version TSM_control status vsping database_control status SRVCLOG_control status cvadmin –e select <fsname> |
none |
Used to get the SystemInfo instance.
We call getSystemVersion() and getState( SystemInfo::system ) on this instance to validate compatibility and online/offline state. |
PassThru |
vsmedqry |
MSM/logs/history/hist_01
|
Get media status snapi command does not return the pending archive field which the plugin needs to determine the target archive for the move hence the use of it with the pass thru command. |
showsysparm |
none |
Used with the 'showsysparm' command to get FS_MAX_ACTIVE_TAPECOPIES and MEDIA_SUSPECT_THRESHOLD values. Used with the 'vsmedqry' command and a volser. Parse the "Pending Archive" out of the result string. |
|
SetMediaState |
fschmedstate |
TSM/logs/history/hist_01 |
Used with the volser of a cartridge and SetMediaState::PROTECT as the state prior to a CopyMedia command. |
SNAPI Config / Install
StorNext
/usr/adic/SNAPI/config/snapi.cfg
snapi.cfg file needs an entry for each MDC - so both are in it - this has to be done on both MDCs.
---------------------- example--------------------------
<?xml version="1.0"?>
<SNAPI_CONFIG>
<PARAMETER name="serverName" value="10.163.8.222"/>
<PARAMETER name="serverName" value="10.163.8.223"/>
<PARAMETER name="serverPort" value="61776"/>
<PARAMETER name="clientTimeOut" value="1800"/>
</SNAPI_CONFIG>
-----------------------example-------------------------
Note: Snapi has to be restarted
SNAPI_control stop
SNAPI_control start
Note: Without both IPs in the config file StorNext will not reissue the request to the other server if appropriate.
Please ensure that SNAPI is running on both MDCs as well.
SNAPI_control status
Scalar i6k Troubleshooting ( Example CGG Veritas )
1. General Log Analysis for EDLM activity
TDM_Session.log
serial partition mount_time unmount_time mb_read mb_written motion_time volser
HU1129HE5G (EDLM) 2 2013-04-26 12:59:56+00 2013-04-26 13:01:02+00 116 4 0 ABC755L5
HU1131HRAA 1 2013-04-26 14:22:07+00 2013-04-26 14:26:22+00 116 4 622 ABC755L5
Michael: wokr/description needed
lmserver.log
2013-04-26 12:49:30,297 INFO (SnmpMediaBuilder.java:332) User manually inserted media: ABC755L5 at: 48
2013-04-26 12:59:56,858 INFO (ManagementTrapHandler.java:624) Tape Drive mounted: [1, 3, 1, 11, 1, 1] SN: HU1129HE5G
2013-04-26 13:01:07,059 INFO (ManagementTrapHandler.java:650) Tape Drive dismounted: [1, 3, 1, 11, 1, 1] SN: HU1129HE5G
2013-04-26 14:22:08,069 INFO (ManagementTrapHandler.java:624) Tape Drive mounted: [1, 1, 1, 12, 1, 1] SN: HU1131HRAA
Michael: description needed about import/expot in regards to EDLM Scan on Import
MeDIATables.log
media.session
id | start_date | end_date | num_incomplete | num_good | num_bad | num_suspect | num_unsupported | test_state | continue_on_error
------+------------------------+------------------------+----------------+----------+---------+-------------+-----------------+------------+-------------------
2388 | 2013-04-26 12:57:51+00 | 2013-04-26 13:01:03+00 | 0 | 1 | 0 | 0 | 0 | 1 | f
media.sessionresults
id | aisle_num | frame_num | rack_num | section_num | column_num | row_num | voltag | session_id | media_id | drive_id | last_run_date | test_state | test_type | test_result | static_test_status | static_test_error_code | dynamic_test_status | dynamic_test_error_code | priority
------+-----------+-----------+----------+-------------+------------+---------+----------+------------+----------+------------+------------------------+------------+-----------+-------------+--------------------+------------------------+---------------------+-------------------------+----------
3129 | 0 | 6 | 1 | 4 | 1 | 4 | ABC755L5 | 2388 | 3080 | HU1129HE5G | 2013-04-26 13:00:28+00 | 3 | 1 | 1 | 512 | 768 | 1029 | 1281 | 2
test_state:3 = COMPLETE
test_result:1 = GOOD
static_test_status: 512 = CM_SCAN_COMPLETE,
static_test_error_code: 768 = CM_SCAN_IS_GOOD
Michael: description needed, correct place for mapping information?
//*
test_type:
1 = SIMPLE
2 = NORMAL
3 = FULL
test_state:
1 = PENDING
2 = IN_PROGRESS
3 = COMPLETE
4 = STOPPED
5 = PAUSED
test_result:
0 = NOT_COMPLETED
1 = GOOD
2 = UNSUPPORTED
3 = SUSPECT
4 = BAD
These are the mappings for the numbers in static_test_status, static_test_error_code, dynamic_test_status and dynamic_test_error_code:
1 = TA_READ_WARNING,
3 = TA_HARD_ERROR,
4 = TA_MEDIA,
5 = TA_READ_FAILURE,
10 = TA_NO_REMOVAL,
11 = TA_CLEANING_MEDIA,
12 = TA_UNSUPPORTED_FORMAT,
13 = TA_RECOVERABLE_CARTRIDGE_FAILURE,
15 = TA_CM_FAILURE,
16 = TA_FORCED_EJECT,
18 = TA_TAPE_DIRECTORY_CORRUPTED,
19 = TA_NEARING_MEDIA_LIFE,
20 = TA_CLEANING_REQUESTED,
48 = TA_DIRECTORY_INVALID,
55 = TA_LOAD_FAILURE,
56 = TA_UNRECOVERABLE_UNLOAD_FAILURE,
59 = TA_WORM_INTEGRITY_FAILURE,
60 = TA_WORM_OVERWRITE_ATTEMPTED,
512 = CM_SCAN_COMPLETE,
513 = CM_SCAN_PAUSED,
514 = CM_SCAN_PENDING,
515 = CM_SCAN_NOT_RUN,
516 = CM_SCAN_IN_PROGRESS,
768 = CM_SCAN_IS_GOOD,
769 = CM_SCAN_NA,
770 = CM_SCAN_FAILED_TO_RECIEVE_CM_DATA,
771 = CM_SCAN_CM_HARDWARE_FAILURE,
772 = CM_SCAN_THREAD_COUNT_THRESHOLD_EXCEEDED,
773 = CM_SCAN_WRITE_PASS_THRESHOLD_EXCEEDED,
774 = CM_SCAN_UNCORRECTED_ERRORS,
775 = CM_SCAN_UNABLE_TO_LOAD,
776 = CM_SCAN_UNABLE_TO_UNLOAD,
777 = CM_SCAN_NOT_PRESENT,
778 = CM_SCAN_NO_COMPATIBLE_DRIVE,
1024 = SCAN_COMPLETE,
1025 = SCAN_PAUSED,
1026 = SCAN_PENDING,
1027 = SCAN_NOT_RUN,
1028 = SCAN_IN_PROGRESS,
1029 = SCAN_NOT_CONFIGURED,
1030 = SCAN_STOPPED,
1280 = SCAN_IS_GOOD,
1281 = SCAN_NA,
1282 = SCAN_FAILED_IO_BLADE_COMM,
1283 = SCAN_FAILED_TO_RECIEVE_SCAN_DATA,
1284 = SCAN_UNEXPECTED_EOD,
1285 = SCAN_UNFORMATTED_TAPE,
1286 = SCAN_FAILED_TO_READ_TAPE_DATA,
1287 = SCAN_UNRECOVERED_READ_ERRORS,
1288 = SCAN_PLACE_HOLDER,
1289 = SCAN_CORRUPT_DATA_FORMAT,
1296 = SCAN_MECHANICAL_FAILURE,
1297 = SCAN_SEVERELY_DEGRAGED,
1298 = SCAN_UNABLE_TO_LOAD,
1299 = SCAN_UNABLE_TO_UNLOAD,
1300 = SCAN_CLEANING_CARTRIDGE,
1301 = SCAN_CM_FAULT,
1302 = SCAN_UNKNOWN_MEDIA_TYPE,
1303 = SCAN_SCAN_ABORTED,
1304 = SCAN_MEDIA_NOT_PRESENT,
1305 = SCAN_ENCRYPTED_MEDIA,
1312 = SCAN_BLANK_MEDIA,
1313 = SCAN_BLOCK_SIZE_EXCEEDS_MAX,
1314 = SCAN_FUP_TAPE,
1315 = SCAN_DRIVE_CM_READ_FAIL,
*//
dbDumpOutput
Michael: work/description needed
2. EDLM\SNAPI related RAS Tickets
Error Code Lookup Tool (ECLT): https://qsweb.quantum.com/users/OPS_site/errorCodeIntro.php
StorNext Troubleshooting
1) SNAPI - force an EDLM scan from within SN by setting high suspect count and mark flag
Example:
echo "use tmdb; update mediadir set Mark='X',Susp='4' where mediaid='AAW914';" | mysql
Fsmount <mediaid>
Fsdismount –m <mediaid>
2.) SNAPI - force the AEL to send a SNAPI request to initiate a fsmedcopy
The default threshold for Thread Count Threshold is 9900.
Changing the EDLM Thread Count threashold to 1, will allow to mark the media suspect during EDLM Scan.
1. ssh to library IP
2. login as ilinkacc (password is the admin password)
3. su (password is dallas)
• This value can be adjusted by:
psql -Uilinkacc i2kdb -c "INSERT INTO media.settings VALUES(DEFAULT, 'MeDIA_ThreadCountThreshold', <n> )"
• Update it with:
psql -Uilinkacc i2kdb -c "UPDATE media.settings SET value = <n> WHERE name = 'MeDIA_ThreadCountThreshold'"
• Delete it and return to the default:
psql -Uilinkacc i2kdb -c "DELETE FROM media.settings WHERE name = 'MeDIA_ThreadCountThreshold'"
Where <n> is any signed 32 bit value. If you set it to -1, all tapes would fail; 1 would fail any tape that has been threaded at least once, etc.
For these to take effect:
• reboot the library after changing this or just:
/etc/init.d/tmmd restart
Checking Current Thread Count of a Tape
• To determine the current thread count of a tape that has been loaded in a drive at least once in the library (003175L3 in this case, note the % at the end):
psql -Uilinkacc i2kdb -c "SELECT vol_tag, thread_count from library.mediastats WHERE vol_tag LIKE '003175L3%'"
Use Cases / Service Request
SR 1560792 - CGG Vertias - QFE replaced the MCB on CGGVertias Library and restored a old config from April 2013 then upgrade MCB from i8.1 -> i10.3
Application Client Plug-In is inoperable
"Application Client Plug-In is inoperable"
R2, H1, Degraded: D2350, RQ=0, By TapeDriveMgr @ Tue May 14 12:34:53 2013
K/C/Q = "No Sense: No Additional Sense Information", SN = ""
Tag: 01_35_07_00_00000000 Error Modifier:0x0
Desc"Failed to load the EDLM extension: snapi-2.0.1"
Since we didnt perform any changes on the StorNext side, it had to be a issue on the i6k.
The i6k was showing the correct configuration information for the SNAPI Plug in and EDLM Config.
The ticket itself is pretty clear, it can not load the EDLM Extention\PlugIn.
Also a quick check on the i6k repair pages suggest to reload and re-configure the PlugIn.
This seems to be a known Bug ( Bug 33197 - EDLM Extension load failure ) , that the save/restore configuration doesnt store the extention/plugin, so it has to be re-installed.
SR 1554934 - CGGVERITAS - EDLM Scan on Import causes TSM/MSM to run into inconsistency
Descr: Enabling Scan on Import and moving tapes from vault to Library, leaves SNSM with inconsistency for medias on the EDLM Scan List
Analysis:
i6k lmserver.log - Medium physic. inserted into the library -
2013-04-26 12:49:30,297 INFO (SnmpMediaBuilder.java:332) User manually inserted media: ABC755L5 at: 48
i6k TDMSession.log - reflects the EDLM Scan Intervall
serial partition mount_time unmount_time mb_read mb_written motion_time volser
HU1129HE5G (EDLM) 2 2013-04-26 12:59:56+00 2013-04-26 13:01:02+00 116 4 0 ABC755L5
StorNext MSM Tac Log
Apr 26 12:55:43 redmeta01.uk.cgg.com snmsm ArcDisp[13444]: E0999(7)<1111641345>:AD_ArcEnterCmd574: ABC755: action mount move
Apr 26 12:55:43 redmeta01.uk.cgg.com snmsm XdiAMTask_3[13449]: E0462(6)<1111635229>:AMTaskFuncs1410: Archive i6k_archive1: Received Mount command (Media ID: ABC755, Drive List)
Apr 26 12:55:43 redmeta01.uk.cgg.com snmsm XdiAMTask_3[13449]: E0999(7)<1111635229>:MountAMCmd1645: MountAMCmd::SelectResources DriveID:18 MediaID:ABC755
Apr 26 12:55:43 redmeta01.uk.cgg.com snmsm XdiAMTask_3[13449]: E0416(6)<1111635229>:MountAMCmd2546: Archive i6k_archive1: Medium: ABC755 and Drive: 18 selected to be mounted
Apr 26 12:55:53 redmeta01.uk.cgg.com snmsm XdiAMTask_3[13449]: E0714(4)<1111635229>:XdiPrimitive1958: Archive i6k_archive1: MOUNT of Medium ABC755 for Drive Slot 00000,00000,00012,00261 was Failed (304 - medium not found in archive)
Apr 26 12:55:53 redmeta01.uk.cgg.com snmsm XdiAMTask_3[13449]: E0999(7)<1111635229>:ArchMedia457: ArchMedia::ArchMedia(ABC755) 'Database'
Apr 26 12:55:53 redmeta01.uk.cgg.com snmsm XdiAMTask_3[13449]: E0104(3)<1111635229>:MountAMCmd3046: Archive i6k_archive1: INCONSISTENCY EXISTS: Mount - Assigned Medium: ABC755 not found - Error: 40
Apr 26 12:55:53 redmeta01.uk.cgg.com snmsm XdiAMTask_3[13449]: E0138(6)<1111635229>:MountAMCmd3586: Archive i6k_archive1: Mount command failed for (Media ID: ABC755, Drive List) - Error: media not available for use
You can see the media was imported into the library partion and shortly after queued for EDLM Scan on Import.
In theory the i6k is expected to feedback the home slot for the media during a EDLM scan making it transparent for MSM and delays the mount request until the scan finished.
In this scenario the i6k didn't feedback the home slot to the Application and the application lost tracking of the tape ending up in a inconsistency
This has been escalated & adressed in Automation Bug 48016 - Import quick scan usurps SN media
SR SR3355462 - Skyvision - forcing EDLM Scan via StorNext doesnt work
Description: Support tried to force a EDLM Scan setting the suspect count to the treshold, which did not trigger a EDLM Scan
Analysis:
The i6k library had the "Media Identifier" set to "pass through" rather than "disabled".
This caused EDLM to send the full barcode on the command to SNAPI. This resulted in this:
fsmedinfo A00007L6
SNSM had no idea about this media so the request failed.
Modified the i6k configuration so "Media Identifier" was set to "disabled" - with stopped SNSM before mod. the library -
Once everything was restarted the test worked and the media was mounted in the EDLM drive and tested.
EDLM Related Bugs
Bug 41930 - SNAPI: GetSystemStatus returns SUBFAILURE if the fsmlist file has any commented out lines in it
Bug 48016 - Import quick scan usurps SN media
Bug 33197 - EDLM Extension load failure
Bug 37274 - provide interface for EDLM to know which barcode format to use.
Bug 63132 - Successfull EDLM Scan should reset Media suspect count
Bug 56435 - EDLM scan can trigger multiple unwanted media check
Edit:
17.05.2017 - added 2 more SNMS Bugs realted to EDLM
ToDo
- SNAPI PlugIn i6k Config/install + Test
- Test Snapi Communication between Library and SN
- restructure wiki page
Attachments |
This page was generated by the BrainKeeper Enterprise Wiki, © 2018 |