Finger Pointing

Overview

We admit this one is very embarrassing for all concerned but IT IS REAL!  Due to the fact that it occurred  within governmental agencies, we will be very very vague about certain facts.  Since there are TWO governmental bodies involved lets define them as agency A and agency B. The nature of agency A can be viewed as one of law enforcement/security.  The nature of services within agency B is highly computer centric.  

We feel this case is an important one to present as it speaks not only to having the proper tools for the job. But it also speaks even higher to a well thought out 'process' or method of using those tools.

Structure

Since both agencies require regional presence they share physical space and that agency A is given computer support by agency B.  

The Stated Problem

A single computer (notebook) in a distant outpost for agency A suddenly displays a shortcut to an unauthorized web site on its desktop.  A call to agency B support team results in the simple instruction to delete the desktop shortcut.  It of course reappears.

We all know this as an 'ad-ware' type of virus/Trojan horse software problem. 

Believe it or not this cycle continues for MONTHS.  Of course this starts to cast shadows upon agency B's capabilities.  Within such shadows, the real purpose and the sensitivity of data for which agency B is responsible comes under question.

The next action is not surprising.  Due to the pressure and the fact that the offending computer is a notebook... it is shipped to the support center.  The machine passes through numerous technicians and undergoes an exhaustive TWO month search for the root cause.   We applaud the tenanancy of the support team for most would have simply reformatted and re-imaged the system.  However in this case it was mandatory that the root cause be clearly identified.

The Root Cause

It is important to note here that the root cause of this problem HAD been resolved before our involvement.  AND that it had been resolved using the exact same tool we would have incorporated. 

The reason we became involved resulted after the fact. This case became so notorious within both agencies that of course the methods of resolution were touted.  It was decided that all members of the support team were to be trained on the use of this tool.   Therefore our engagement was one of pure training.

The root cause was indeed a nasty little piece of code - hidden in plain sight somewhere on the hard drive.  So methods needed to be employed to locate the EXACT code component responsible and from there determine how best to remove the rascal. 

So after months of trial and error, the single staff person with an awareness of the techniques of this type of research began to look INSIDE this computer as it did its thing.  This senior staff person is the first to admit that only by pure luck after two days of gazing at these details was his attention drawn to 'something curious'.  It was also stated that this 'attention getter' happened just as he was giving up and about to re-image the computer. 

The Results

Of course during the training session it was suggested that the notorious case be used as one of the hands on lab exercises.  The test results were still available, and were being offered to the instructor for review before the class.  The instructor refused to accept the test data, and further requested NO DETAILS about the problem or its solution. One would assume the instructor needed to review the test data in preparation for classroom use. BUT the instructor wished to impress his students with the process and not the tool used.  It was his hope to show how the application of a well defined method of analysis would reveal root cause with efficiency and accuracy.

So at the appointed place during training, all were eager to have this technician present the 'problem' and its 'data' to the instructor.  The instructor insisted that anyone in the room who had become aware or even thought they knew the root cause to please remain silent and simply observe the method of analysis. They were asked to note how long in time and mouse clicks, etc. it took to arrive at the root cause. Projected on the classroom wall, the instructor began drilling into the data and giving a verbal account to the class of the process.  

There was an interruption as the third report flashed on the screen.  It was the original technician expressing amazed disbelieve for there clearly on the screen was the root cause as found by him and in his mind, found by pure luck.  It had taken the instructor only 6 mouse clicks, and a couple of minutes to announce the exact culprit.  It took the rest of the day, to explain to the class exactly how the method/process led the instructor to that report.  For even after watching it real time... it seemed to them like magic.

In fact the prepared classroom agenda was abandoned as 6 current issues were presented to the instructor as further tests of the methodology.  Every issue was resolved in that classroom, not by the instructor but by the students.  The instructor was correct to refuse to analyze any further issues. He simply challenged the students to apply the methods themselves.  He simply repeated the methodology as each issue was presented and solved.   

Summary 

This is a classic example of the advantages of employing a structured approach to troubleshooting.  In the case of software related problems, the complexity of hardware options, coupled with software versions and exact path of execution present far too many variables to leave accurate analysis to chance or assumption. 

The only path to repeatable success is through the application of proven methodologies. 

There are actually only FOUR scenarios when approaching software related issues.  When the problem is articulated its nature will dictate which of these analysis approaches will be used - initially.

1. Works on one computer does not work on another.

2. Performance / speed of execution require improvement.

3. Unexpected behavior or inaccurate results.

4. Inspection to ensure code is error free.

When hearing an issue an experienced troubleshooter will categorize the issue into one of these groupings. For example it is a complete waste of time to look for performance affecting metrics on a piece of software that is not functioning correctly. 

Of course many issues are reported from the wrong perspective. For example, a performance related analysis typically turns into an unexpected behavior exercise.  It is unexpected behavior that consumes resources and time. See 11th Hour case for an example of this. In these cases the performance metrics only identify WHAT is taking time... not why.  One must shift methogologies from performance to behavior to complete the analysis.

Point is the troubleshooter must be methodical in confirming facts.  Without the ability to see inside the computer to confirm the facts at that instant on that computer... only assumption can be used for analysis.  So the tools, as necessary as they are to confirm facts, are only as good as the methodology employed to draw inferences from those facts. With an inference in hand, one must then identify additional facts to confirm the inference.  Only then will the true root cause be located.