Phase 2: Analysis

Analysis is the second phase in the troubleshooting process. In this phase, you need to get a solid understanding of the problem. You cannot successfully carry out the third phase, Implementation, until you completely understand the problem.

The Analysis phase is composed of the following steps:

Identify all possible causes.
Identify the most likely causes.
- Apply falsification to likely causes.
- Consider the root cause.
- Rank the likely causes.
Test all possible causes.

Identify All Possible Causes

Identify as many possible causes as possible for the problem. Be disciplined. Try to think through each aspect/part of the issue you are troubleshooting. Determine if any possible cause can be ruled out, based on the information you have gathered.¹ For example, if someone tells you that their television picture is grainy and choppy, you can immediately rule out "no power" as a possible cause.

Using the resources at your disposal, identify if any of the information that you gathered in the Investigation phase points to a known issue. Product documentation and Quantum information sources (such as TSBs, CSWeb, the Knowledge Base, and Qwikipedia) can be helpful.

If the problem does not match a known issue, it may be helpful to collaborate with your peers to identify possible causes, if you work in a team environment. They may have additional ideas about possible causes, or might approach the troubleshooting problems from a different perspective. It's hard to think of everything, so keep an open mind about your peers' ideas.

Example

Using the car example, a list of possible reasons why the car would not start might be:

The ignition switch may be faulty.
The battery may be faulty.
The alternator may be faulty.
The starter may be faulty.

Identify the Most Likely Causes

Consider how likely each potential cause is. Do not eliminate a possible cause until you absolutely disprove it.¹

Apply Falsification

Apply falsification to eliminate possible causes. The idea behind using falsification is to treat your initial conclusions about a complex troubleshooting problem as being untrustworthy. Determine what evidence disproves a possible cause, rather than looking for something to confirm what you think might have caused the problem.

By disproving a possible cause, you can save a lot of time. You can discard that cause and move on to the next one. If you cannot find any evidence that a possible cause is wrong, you will have more confidence that you may be on the right track.²

Using the car example mentioned above, you could perform falsification testing on the ignition switch, battery, and alternator. Here are some things you might find.

Ignition switch: The fact that the customer heard a "clicking" noise when they started the car means the ignition switch is working. You can rule this out as a possible cause.
Battery: Check the battery cables for corrosion, since this would keep power from flowing freely to the starting system. Let's say that the cables are fine and not corroded. You can then test the lights, windshield wipers, and radio. If they all work fine, you can rule out a bad battery as a possible cause.
Alternator: In this case, say that there are no problems with dim headlights or reduced radio power. The windshield wipers, headlights, and radio all work fine. You can rule out alternator problems as a possible cause.

Consider the Root Cause

In addition to applying falsification, consider identifying the root cause as the goal when identifying the most likely causes. The practice of root cause analysis is predicated on the belief that problems are best solved by attempting to address, correct or eliminate root causes, as opposed to merely addressing the immediately obvious symptoms. By directing corrective measures at root causes, it is more probable that problem recurrence will be prevented.³

For most problems, you can get to the root cause by drilling into proposed explanations by repeatedly asking "Why?" The "5 Whys" method was developed by the Toyota Motor Corporation. It is based on the observation that five iterations of asking "Why?" are usually enough to get to the root cause of most real-world problems. The answer for each Why adds up to the overall big picture and helps get to the root cause. For example:⁴

Problem Statement: The system crashed. (Why?)
A memory chip failed. (Why?)
The machine room temperature exceeds recommendations. (Why?)
The HVAC unit is undersized, given our heat load. (Why?)
Our projections for heat load were lower than what has been observed. (Why?)

Root cause: We did the heat-load projections ourselves, rather than bringing in a qualified expert.

Rank Likely Causes

After performing falsification and considering the root cause, rank the remaining likely causes, from most likely to least likely.

In the car example, falsification was used to rule out the other three possible causes -- the only likely cause left is that the starter is faulty. This means that you can test this cause in the next phase. If there had been another possible cause left, you would need to rank these causes in order of probability.

Test All Possible Causes

As mentioned earlier, check the available documentation, such as user guides. service manuals, and other Quantum resources. These often include recommendations about how to test. Sometimes there are built-in testing facilities, and sometimes there are hardware-specific issues to consider, which may be covered in the documentation or other resources.

If possible, always back up data before testing. Then, start by testing the remaining likely causes, in the least disruptive fashion possible. Follow up with less likely causes. If non-disruptive tests can be done, always start with those.

Depending on the situation, it may even be appropriate to test the likely cause by directly applying a recommended fix for that problem. If you do this, always apply only one fix at a time. If the fix fails to solve the problem, remove it (back out of it) before you test the next fix. Otherwise, applying multiple fixes may keep you from getting a good handle on the root cause of the problem.¹

Remember, it is important to emerge from the Analysis phase with a solid understanding of the problem. Do not move ahead to the Implementation phase before you understand the issue at hand and the possible reasons for the problem.

Example

A bad memory module was identified as a possible cause of a server problem. The vendor's product documentation contained step-by-step procedures on how to verify if the memory module was faulty, and it included procedures on how to properly remove the old module, install a new module, and test the new module. Starting here is clearly a good idea.

What's Next?

Phase 3: Implementation >

Additional Resources

Falsifiability page on Wikipedia
Root cause analysis page on Wikipedia

References

Cromar, Scott. "Troubleshooting Methodology," Princeton University Enterprise Servers and Storage, 2007; available from http://www.princeton.edu/~unix/Solaris/troubleshoot/methodology.html; accessed on January 4, 2011.
Thomas, Orin. "Falsification as a Troubleshooting Methodology," WindowsIT Pro, January 11, 2010; available from http://www.windowsitpro.com/article/systems-administrator/falsification-as-a-troubleshooting-methodology.aspx; accessed on January 4, 2011.
"Root cause analysis," Wikipedia, February, 10, 2011; available from http://en.wikipedia.org/wiki/Root_cause_analysis; accessed on February 15, 2011.
Cromar, Scott. "Root Cause Analysis," Princeton University Enterprise Servers and Storage, 2007; available from http://www.princeton.edu/~unix/Solaris/troubleshoot/rca.html; accessed on January 4, 2011.

Good point, Ed. I've introduced RCA in the Analysis phase when identifying the most likely causes.

Note by Tom Sajbel on 02/15/2011 12:26 PM

Shouldn't Root Cause Analysis be mentioned in this article? The page on RCA referred to in the next article (Implementation), in its section on "General principles of root cause analysis," second point, says:

"2. To be effective, RCA must be performed systematically, usually as part of an investigation, with conclusions and root causes identified backed up by documented evidence. Usually a team effort is required." (my emphasis)

RCA could be used again in the Implemenation phase, but I would think it needs to be part of the Analysis phase, as well.

Note by Ed Winograd on 02/14/2011 03:50 PM

I added a reference to using available documentation, such as product docs and Quantum-specific documentation (CSWeb, TSBs, KB), in addition to the product documentation.

Note by Ed Winograd on 02/14/2011 03:49 PM

Allan, good point about "chasing a wild goose." I doubt it has the same meaning around the world. I rewrote that section.

Note by Tom Sajbel on 02/11/2011 10:52 AM

Not real sure about the "chasing a wild goose" statement. It kind of sounds like a bad translation. Would this statement be applicable to our APAC & EMEA groups? Would it translate well enough for them to understand the reference or is it too "Western Culture"?

Also fixed some punctuation and repetitive words. (period instead of a question mark and removed a repetitive "to to")

Note by Allan Ransom on 02/10/2011 03:14 PM

Good point about the car example. I'll try and rework that example.

Thanks also for adding the additional content about using info at hand to rule out a possible cause. Good stuff.

Note by Tom Sajbel on 02/02/2011 12:18 PM

Ransom:

The car example given for the identification doesnt really have a non-disruptive test if you have narrowed the scope of the problem to the starter. Its pretty much replace the starter.

I also added some info to the possible causes section regarding whether anything can be ruled out based on the info at hand.

Note by Allan Ransom on 02/01/2011 06:11 PM

Test possible causes

Herbst: I'm not sure about this example...just wanted to make a note of this.

Note by Tim Herbst on 01/27/2011 03:23 PM