Performance Issues

Overview

Performance affecting issues are perhaps the most rewarding aspects to software problem resolution.  It is an area requiring an amount of 'artistic' experience.  It is truly a joy to watch such artisan at work. 

The scene:

Web based financial application, nearing the end of a long development effort by a team of almost 700 developers.  Typical of such massive projects it was the 11th hour and many components were simply not meshing up as expected.  Our involvement was arranged for advanced 'training' into the best uses of certain diagnostics tools that were being used and confusing team management.  They were simply trying to stop the finger pointing among their own teams.  Or so we were told...

Arriving on site for a three day training exercise, the instructor was taken to a coffee shop for a briefing. Here the true state of affairs was relayed:

1. A demo to their client was scheduled in four days.  The demo could not be postponed again.

2. If any of the current 'issues' were still resident the project would be terminated.

3. Would the instructor shift from teacher to consultant and assist with clearing the current issue list.

GULP!  How many issues are on the list?  1200 was the sheepish reply.

The instructor said; "You have me here, scheduled for three days.  We will use this first day and I will attempt to cover TWO days of training in only one.  We will then divide the class room into groups and attack the issues as in the war room approach.  That one day of training is required, for understanding of terminology and hopefully a respect for the methodologies we will employ.  For I will be directing multiple teams at the same time, and the ability to communicate will be vital."  

They begged to not spend the first day being lectured to.  And then relented as the instructor insisted it was necessary. 

 

Structure

Imagine three days to resolve 1200 issues involving over nearly 700 people.  The communications issues alone would take weeks.  What was the trick the instructor had up his sleeve?

The knowledge that ALL software issues fall into one of four scenarios.  He simply asked how many problems are, Performance Related:  answer 968.  How many are behavior (or it is not performing as expected)... 3.  Then that means the rest are related to works once in awhile, or on this machine but not that machine... they agreed. Also the instructor had faith in the METHODS more than the tools or these unknown developers.

On the morning of the second day, all were anxious to get started and the instructor divided the management members into FOUR teams.  Performance, behavior, comparison (works there but not everywhere every time) and the fourth was a swat team to be assigned tasks as needed.

He charged the behavior and comparison teams to commence gathering test results as described the day before.  This test data was required from multiple machines for each failing component and would take some time to gather. If any questions surface, they were free to interrupt the instructor. The instructor then focused on the performance team.

This case report will focus on the results of the performance analysis as it is most dramatic.  We will share at this point that by 5PM that day there was only ONE issue remaining on the issue list... Yes, the team wished to press through that one, and the instructor suggested they do so themselves, as he would be happy to review their work on the third and final day.  And if they were unsuccessful that would leave plenty of time for its resolution using the instructor's advice. 

The Stated Problem

Performance Issues.  In this case there was a stated contractual threshold of 3 seconds maximum for ANY page presented on the client computer.  Recall a Web based solution, which uses web browsers at the client, a web server and behind that a database server.  It is often difficult to predict exactly what this maximum client time will be until AFTER the application is developed and running.  However in this case, it was mandated no page would take longer than 3 seconds. 

Interesting to note that a majority of the pages listed under this issue, were marked as SOMETIMES.  Research had even been done to show that on a majority basis an errant page would draw in well under the 3 second mark.  BUT even once at 3 seconds or more would cause project termination. At times the worst case could be 7 or 8 seconds... followed by 1 or 2 seconds for the same page. They were surprised when the instructor advised this is perfectly NORMAL behavior when dealing with web servers and databases due to their caching methods.   

The Root Cause

So the exercise was to CATCH test data on one of the long running requests.  Actually a trivial matter with the tools being used.  From here a classic shift commences from a performance analysis method to one of software behavior.  This is not uncommon and we simply point it out in advance so as not to lose the reader mid stream.

From the report of ALL pages processed on the web server, and drilling into a single page which required more than three seconds (6.34 in this case), we find that the processing of the page contains 15 'exceptions' thrown.  For the un-informed exception handling is very time consuming for computers. 

To confirm an inference, the instructor then shifted from by page to a by exception report... and there on the report was every page in the test data that took more than three seconds.  Additionally, the instructor confirmed that every one of the exceptions was being produced from exactly the SAME component. 

An immediate shift from performance metrics to code module behavior occurred.  The instructor drilled into the details of the module.  It is important to note here that internal project security of intellectual properties precluded direct access to source code.  The instructor would not be granted access to any source code for any components.  Small matter for the instructor pointed to the details of the module and declared it would be most efficient if the coder -name displaying on the screen- could join us for a chat.  After a few chuckles around the room the manager recovered and scurried off to fetch his charge. 

He returned moments later with a young and very nervous developer who was carrying his notebook computer mare as a shield than a computer. 

The Results

A little background is necessary here so all can comprehend the events that occurred.  First the offending module is designed to LOGIN a user.  You know that piece that takes username and password and checks to ensure they are allowed to 'do this'.  Due to the nature of web based applications, this function has to be repeated on EVERY request to the web server.  And in this case EVERYTIME there was an issue with this login function the page required excessive time to process.  The instructor had already confirmed in his own mind that fixing this one piece of code... root cause... would in fact clear 968 issues from the issue list.  He suspected the classroom was also already aware of this fact.

Without source code to examine personally the instructor's only recourse was to 'dictate' to this developer exactly WHAT was wrong with his code.  A delicate situation to say the least.  So the instructor first advised the young coder of how good his code is and that he was NOT in jeopardy.  Just relax and this will take only a few minutes.  Like the patient in a dentist chair the coder seemed wary... the management team was sentencing pending success. This of course fell squarely on the young coder. 

The instructor simply declared point blank... on line X of source code file Y you are using the IsNumeric function to verify that the login password is all digits. Correct?  It took a minute for the developer to reach the mentioned line of code.  He replies, more a question than statement, Yes?  And that function call is inside a loop that repeats for all 15 possible characters of the password.  Correct?  Again, yes.  The instructor pressed on, We would appreciate it if you would do TWO things in that block of code.  First test to see if you have ANY password characters before entering the loop and second, discontinue use of the built in Is Numeric function and write your own.  One that works by true false response rather than by exception.

He was bewildered and confused it took almost 30 minutes for the management team to catch him up on exactly what had transpired and why the request was being made.  We all did our best to put him at ease that his efforts were exactly correct and in keeping with coding best practices, he had done his job and done it well. But in this case with this software it required a minor bending of the best practice rules.  He agreed to try and scurried off.  He return only seconds later with a question..... 'How does one rewrite the IsNumeric function.' Everyone laughed knowing he was joking and was simply confirming what we all felt the pressure was suddenly lifted... the real root because had been identified and its fix was a few lines of code and only minutes away.  Our team had in just over one hour removed the majority of the Performance Issues in the system. 

Later tests would show that not only do all pages draw in less than three seconds.. BUT EVERY page was how drawing in 25% less time than before.  All from ONE line of code.

 

Summary

Performance improvement is an interesting challenge at any level.  No one should extend nor expect any percent of improvement. You get what you get when tweaking performance.  

Our experience is that performance improvement most always turns in to an errant code situation.  There are FEW cases where one simply says 'no errors' so just change from this method to that for faster performance.  Most performance related issue stem from errant code.. Not inefficient code.

Also of importance for this topic, is not to rely on assumption about where performance is being lost.  Countless hours are spent improving code that in fact only contributes to a minor portion of the time. 

One MUST measure to be certain exactly WHICH code is taking the MOST TIME.  Many coders will 'feel' they know which the inefficient block of code is.  They in turn spend hours and hours with a resulting impact that is immeasurable.

Refer to the 'Minor Issue' for a look at how a solid application appeared to work, when in fact it was throwing countless exceptions. When the errant code was corrected performance was stellar.

Further advice:  These people were lucky with the results of such 'hidden agenda' for the engagement.  It is always wise to be candid and honest when engaging technicians for such demanding work.  For without a true software problem expert the outcome of this engagement could have been far different.