Often the first approach to optimizing website performance is to add more hardware. This works as long as hardware is the limiting factor, but may have little or no effect if the bottleneck is really the database behind the site. Faced with rapid website user growth, a shortage of skilled technical staff, and the careful scrutiny of Venture Capital investors, many dot.com startups are struggling to find effective ways to optimize the performance of their websites. This is a growing concern among IT managers, especially those who oversee high-performance, high-availability, rapidly growing web-based systems. The ultimate goal of a system manager is to stay ahead of the growth curve while providing optimal service. Many issues can be resolved by a competent internal IT staff. However, knowing when to bring in outside technical experts and how to work effectively with them can mean the difference between quick and efficient solutions or unnecessary expenditures and possible system failure.
This paper details the Rare & Dear Performance Team’s approach to diagnosing performance bottlenecks, including the strategies and tactics it employs to resolve performance limiting issues in a quick and effective manner. The methodology is neither startling nor proprietary, rather it is a well-planned, focused strategy that they developed to intervene in performance-critical situations. Knowledge of a process for identifying problems, implementing short-term tactical solutions with longer-term system strategies, and understanding the critical success factors for website performance remediation, are fundamental for every IT professional.
Key
Success Factors
The following points describe key success factors that lead
to victory .
- In-depth
knowledge of system: It is
important that the Performance team quickly gain a deep understanding of
the entire system that is having problems. This can be achieved through system monitoring tools,
written documentation of system configuration, written records of system
change activity, or oral recollections by Client staff members.
- Team
approach: It is
important that the Performance team take a broad look at the entire
system, and not lose time getting lost in the details of any one
sub-system. Several individuals
with complementary skills and background working as a team provide the
most efficient means of achieving this goal. A methodology that forces individuals to review and critique
their own assumptions in light of all the information about the system
allows the team to stay on track.
- Top
management support: The Client must commit full and unequivocal
top management support to the Performance team. They must communicate this to the staff consistently and
frequently so that access to systems and reports, answers to questions,
and identification of other relevant resources were readily obtainable by
the Performance team. Confronted
with competing urgent problems and projects, the staff must promptly
respond to Performance team requests.
Without this commitment, the process will take much too long to be
tolerable in the dot.com world.
- Dedicated
command center: If possible, the Performance team should work
off-site, to be isolated from the mainstream of daily operations. If that
is not possible, then the Performance team should be housed in a dedicated
operations room. Here, they will
set up network and system access, phone lines, email, and a meeting place
for the team. It is a “do not disturb” zone that insulates them from being
bombarded with worried developers and supervisors wanting the latest
status report.
- Detailed
activities log: The Performance team maintains a detailed
activity log of all meetings, data, performance issues, system development
activities, communications and contacts so that as system modifications
are made, they could be correlated with other events. The log becomes part of the system
documentation.
- Internal
and External Performance Teams: Most companies will have their own internal performance team
that works on performance issues prior to the arrival of the Performance
team. The external Performance
team will work closely with the internal team but will bring fresh
perspective provided by an external and unbiased team.
- No
Assumptions: The Performance team arrives without
preconceptions or attachment to the existing systems. They question every statement and
explanation presented by the staff.
Their holistic approach emphasizes the importance of the entire
system and the interdependence of its parts.
- Road
to Recovery: The Performance team will provide the Client
with immediate relief through temporary work arounds. However, these short-term actions may
not be appropriate for a long-term solution. The Performance team must
also provide a prescription for system modifications that can provide a
long term improvement.
- Definition
of Success: Criteria for successfully solving any
performance issues are difficult to quantify in advance. A working definition of success may be
to “provide system availability during all levels of current site traffic
and allow the system to function well into the future.” This vague, but recognizable,
definition can be workable when there is mutual trust between the Client
and the Performance team. In other
situations, a more measurable definition may need to be defined, such
as: “>99% availability in a
day” with “average user response time < 2 sec” for “two consecutive
days”.
Typical
Situation and Tactics
This section describes a typical engagement by the Rare
& Dear Performance team.
As the first step in getting to know the Client system, Rare
& Dear provides the Client with a Questionnaire asking for detailed
information about technical aspects of the web system and its performance
characteristics. As the second step in
the process, the Performance team installs a proprietary Remote Diagnostic
package on the Client system to gather preliminary system and database
statistics. The data from a 24 hour
period is analyzed in light of the information provided by the Client. Both static and dynamic measurements are
taken to highlight OS and database activity as well as website response time
(from the users’ perspective). These
measurements are charted, graphed, compared with each other, and compared
against “reasonable” values. The result
of this is a report that indicates the most likely areas to investigate more
deeply for solutions to performance issues.
After the Remote Diagnostics are run, a decision is made as
to whether or not the remediation work can be accomplished remotely or whether
is requires the Performance team to work on-site. If the team must work on-site, a command center is set up as
described above. The team is then
briefed by top management and introduced to key personnel who will facilitate access
to systems and information. For remote
projects this is accomplished via teleconference or even
video-teleconference.
As typifies all dot.com projects, this activity needs to be
defined and implemented quickly, with maximum communications but minimum impact
on the Client’s existing systems and staff.
To get a quick handle on the current system performance with an eye to
identifying bottlenecks, the Performance team takes a broad look at a number of
different factors with one sub-team assigned to investigate at a macro level
(top down) while a second sub-team is assigned to investigate at the micro
level (bottom up). The entire
Performance team meets every two hours to share findings and fine-tune the
investigation.
Any existing tools for gathering system performance data are
reviewed. A Client may have a wide
variety of tools in place and still not have the staff skilled enough to
identify a complex problem.
The Performance team interviews key staff to gain knowledge
of the historical site performance and growth patterns, recent system changes
and planned system changes and additions.
The Performance team initially surveys the entire system for
performance and capacity metrics. The
depth and breadth of the Performance Team’s knowledge enables them to quickly
focus on critical issues that eventually lead to optimal solutions. Their fresh, detached perspective supports
the Client’s staff to question all
assumptions and hypotheses formed by both internal staff as well as Performance
team members.
A number of questions are asked of each application that is
having performance problems, for example:
- How
is the application using and accessing the data?
- How
is the database parsing the SQL statements?
- Are
sufficient operating system resources allocated and configured to support
Oracle?
- Are
sufficient hardware resources available and configured to support Oracle?
After synchronizing performance graphs from the various
sources, a picture often emerges showing a high correlation between certain
database activity and site performance degradation. A theory of the cause of the bottleneck, can be
hypothesized. More examination of the
apparent bottleneck will either prove or disprove the theory. One or more rounds of hypothesis and
examination may be required to clearly identify the problem(s). Often there is more than one problem that
contributes to poor site performance.
Within a day or so, the Performance team will usually have
isolated the problem and provide a temporary work-around to achieve acceptable
performance (if the live site is experiencing unacceptable performance). After the Client has implemented the
work-around, the Performance team continues to observe system behavior over the
next 24 hours. If performance is
acceptable, they will develop both short term and more robust long term
recommendations to make the improved performance permanent.
After a de-brief to Client management, the Performance team
will terminate the intense site scrutiny.
However, at the Client request, they may continue to monitor the
performance for a longer period to assure no other problems arise. A full report of the team’s findings and
recommendations is also provided to the Client. Recommendations may include hardware, software, development
methodology, operations, and even management approach. The Performance team is then in a position
to provide direction and oversight to Client staff in implementing any of the
recommendations.
Conclusion
If you recognize the need for outside intervention, you must
provide full support down the chain of command to facilitate quick turnaround
and a good return on your investment.
By using an external team of database and system performance experts,
you can often leverage internal knowledge with an experienced and complementary
external team to implement a performance improvement plan that can save
hundreds of thousands of dollars in avoided hardware upgrades.
© Rare & Dear, Inc. 2002