Monday, March 5, 2012

Finding the Causes of Poor Performance in SQL Server, Part 1


Part 1: Rounding up the likely suspects.

To tackle performance problems with applications, you first find the queries that constitute a typical workload, using SQL Profiler: Then, from the trace, you find the queries or stored procedures that are having the most impact. After that, it is down to examining the execution plans and query statistics to identify queries that need tuning and indexes that need creating. You then see what effects you've had and maybe repeat the process. Gail explains all, in a two-part article.
Troubleshooting a SQL Server performance problem is often a frustrating exercise as, without some experience in tuning, it's hard to know where to start and what to do. In this article, part 1 of a 2-part series, I'm going to walk through a hypothetical scenario that takes a badly-performing database and demonstrates how to find the problematic procedures, and the problematic queries within them, using SQL Profiler.
In part 2, I'll examine the execution plans and query statistics for the poorly-performing queries identified by Profiler, and propose some possible fixes for these problems.

The Problem

I'm the DBA responsible for maintaining an internet forum. Initially the performance of the site was great but, as the volume of data has grown and the load on the server increased, the queries have grown slower and deadlocks have become a frequent occurrence.
After discussion with the developers, I've identified indexing as a likely issue, because it sounds like little time was given to index design or performance tests during development. In fact, a cursory glance at the forum database reveals that no indexes were created other than the primary keys!

The Plan

It's very likely that creating a few useful indexes will speed things up nicely. The problem is that creating useful indexes is impossible without knowing exactly what queries are running against the database, what tables and columns they're using and which ones are contributing the most to the performance problem.
In light of this, I've devised the following plan to identify the indexes, and any other changes, that are necessary to restore forum response times to the required level:
  1. Capture a profiler trace for a short period while the server is active. This will identify the queries that constitute a typical workload.
  2. Import the trace results into a database table and analyse them to find the queries or stored procedures that are having the most impact on the server.
  3. Run those stored procedures in Management Studio against a test database, and examine the execution plans and query statistics.
  4. Use the execution plan and statistics to identify queries that need tuning and indexes that need creating.
  5. Implement the changes and review the effects.

Obtaining Trace Information

The first thing that I need to do is to identify the procedures that have the highest duration, the highest CPU, or are performing the most IOs. It's a waste of time to try and optimise procedures that are either running well, or are never run
I'm not going to use the Profiler GUI to obtain the trace. Using the Profiler GUI is fine for a development server, but it can cause performance degradation on busy servers, even to the point of crashing the server. Hence I'll be using a server-side trace.
NOTE: For details of why the Profiler GUI can cause problems, see this post on Grant Fritchey's blog.
While I won't be using Profiler to run the actual trace, I will use it to create the trace definition, since generating this by hand can be error prone (not to mention tedious). Once I have the trace definition, I'll capture the trace file using server-side trace, which uses the sp_trace set of stored procedures. This is the same set of procedures that Profiler uses "under the covers" when you capture traces through the GUI.

Creating the Trace Script

A trace definition describes the trace that you would like to capture. In other words, it defines:
  • The SQL Server events that you would like to capture
  • The data columns that you would like to capture for each event
  • Any filters that you would like to create, in order to omit from the trace any occurrences of the event in which you are uninterested. For example, it is common to capture only those events that are generated by a particulate user, application or database.
In order to generate the trace definition, I must first create a trace within Profiler that defines all of the events, columns and filters that I want. For identifying poorly performing stored procedures, the most important events are:
  • RPC:Completed (in the stored procedures event category) This event is fired whenever a remote procedure call completes. An example of a Remote Procedure Call would be a stored procedure executed from a .NET application where the SQLCommand object's command type is StoredProcedure.
  • TSQL:BatchCompleted (in the T-SQL event category). This event is fired whenever an ad-hoc SQL batch completes executing. An example of an ad-hoc SQL batch would be a query run  from Management Studio, or a query executed from a .net application where the SQLCommand object's command type is Text.
The most important data columns are TextDataCPUReadsWrites and Duration. Other columns, such as theLoginNameApplicationName and HostName, may be useful for identifying where the query comes from, but they are not essential for identifying which queries are performing poorly.
Start up Profiler, create a new trace (call it "WebForums" or similar), and then select the events and data columns as shown below in Figure 1. At this time, I'm not going to use any filtering. I want to see all of the queries executed against the server because I don't know what is important and what is not at this stage. I'm also not setting up any form of grouping as grouping only appears within the Profiler GUI and has no effect on the script generated.
Figure 1: Creating the Trace Definition
In order to get the script definition, I need to start the trace. For this reason, I'm doing this with Profiler pointed at a dev server, not at the production box. I'll stop the trace immediately as any results are unimportant; the trace must just run before it can be scripted.
Once I've started and stopped the trace, I can generate the trace definition. The script option is available under the File menu, by selecting Export | Script Trace Definition, as shown in Figure 2:
Figure 2: Exporting the Trace Definition
Once the trace definition is saved to file, I can open it in Management Studio and make a couple of modifications to it, before I run the trace against my production server. The items that I need to change in the generated script are the output file, the file size and the end time.
The trace file should be saved on a fast drive that does not have data or log files on it. I'll often use a backup drive if there's nothing else available. The heavier the load on the server, the more important it is that the destination drive for the trace does not contain any database files. I also set the trace file name to contain a time stamp, so that the trace name doesn't have to be modified if I need to run it again.
SET @OutputFileName = 'F:\Traces\WebForums' +
    CONVERT(VARCHAR(20), GETDATE(),112) +
    REPLACE(CONVERT(VARCHAR(20), GETDATE(),108),':','')
Note that the output file does not need the .trc extension, as it will be automatically added.
I set the end time of the trace to be offset from the current time so that the trace doesn't have to be run at a specific time. Half an hour, or an hour, is usually a good time period to run a trace. Much longer and you will find that there is just too much data to analyze.
SET @EndTime = DATEADD(mi,30,getdate())
Finally, I modify the MaxFileSize to something reasonable, so that I'll have just one or perhaps two trace files will be created for the interval I'm tracing. If the size is set too small, meaning that many "rollover" files are created, it becomes tedious to work with them later
The complete trace script, once I've finished modifying it, looks like this:
DECLARE @rc INT
DECLARE @TraceID INT
DECLARE @MaxFileSize BIGINT
DECLARE @EndTime DATETIME
DECLARE @OutputFileName NVARCHAR(256)

SET @MaxFileSize = 100
SET @OutputFileName = 'F:\Traces\WebForums' +
    CONVERT(VARCHAR(20), GETDATE(),112) +
    REPLACE(CONVERT(VARCHAR(20), GETDATE(),108),':','')
SET @EndTime = DATEADD(mi,30,getdate())

exec @rc = sp_trace_create @TraceID output, 0, @OutputFileName, @MaxFileSize, @EndTime

-- Set the events and columns
declare @on bit
set @on = 1
exec sp_trace_setevent @TraceID, 10, 16, @on
exec sp_trace_setevent @TraceID, 10, 1,  @on
exec sp_trace_setevent @TraceID, 10, 17, @on
exec sp_trace_setevent @TraceID, 10, 18, @on
exec sp_trace_setevent @TraceID, 10, 12, @on
exec sp_trace_setevent @TraceID, 10, 13, @on
exec sp_trace_setevent @TraceID, 10, 14, @on
exec sp_trace_setevent @TraceID, 12, 16, @on
exec sp_trace_setevent @TraceID, 12, 1,  @on
exec sp_trace_setevent @TraceID, 12, 17, @on
exec sp_trace_setevent @TraceID, 12, 14, @on
exec sp_trace_setevent @TraceID, 12, 18, @on
exec sp_trace_setevent @TraceID, 12, 12, @on
exec sp_trace_setevent @TraceID, 12, 13, @on

-- Set the trace status to start
exec sp_trace_setstatus @TraceID, 1
All I need to do now is run that script when the server is busy and wait half an hour. After that, I have my trace file.

Interpreting the Trace

The next step is to analyse the trace file to determine which are the procedures and queries that are placing the heaviest load on the server. I'm going to do that by loading the trace file into a database on a dev server.
One option would be to load the trace file into Profiler and then use Profiler's "save to database table" option. However, I prefer to create a dedicated table, called TraceResults, and then use a built-in SQL function calledfn_trace_gettable, to read the trace and import it into the table:
CREATE TABLE TraceResults (
 TextData VARCHAR(4000),
 Duration INT,
 Reads INT,
 Writes INT,
 CPU INT,
 StartTime DATETIME,
 ProcedureName VARCHAR(100)
)
GO

INSERT INTO TraceResults
(TextData, Duration, Reads, Writes, CPU, StartTime)
SELECT TextData, Duration/1000, Reads, Writes, CPU, StartTime
FROM fn_trace_gettable('F:\Traces\WebForums20090215193108.trc',1)
Notice that I'm dividing the duration by 1000 because, in SQL 2005 and 2008, Profiler captures the duration in microseconds, and I have no interest in durations with that level of granularity.
The reason for using a script to import the trace is twofold. Firstly it's easier and quicker to run a single query than to start Profiler, load up the trace file and then save it to a database table. Secondly, if the loading is done with a script, then both the creation of the trace and the loading of it can be automated using SQL Agent jobs. So, for example, you may have a job on the main server to start the trace and then one on the dev server, which starts an hour later, to copy the trace file over and import it.
Once I've got the trace file into a table, I can query it. It's tempting to simply order by the duration, reads or CPU and just focus on the top few records. However that is not sufficient. Adopting this strategy here, I find that "most expensive" query is a search query that has reads and duration far in excess of any other query, as shown in Figure 3:
Figure 3: Finding the "most expensive" queries
However, further investigation shows that the search query only ran once in the entire sample period, as shown in Figure 4:
Figure 4: The query with the longest duration ran only once over the analysis period.
A query that runs only once a day and minutes to run is less of an overall problem in general than one that runs 10 times a second and takes a half second to run each time. The queries that I need to optimise most urgently are the ones that consume the highest cumulative amount of time, or CPU, or number of reads, over the period monitored.
If ad-hoc SQL is involved, calculating the cumulative time, reads or CPU can be extremely non-trivial, as the slightly different query structures, and different values that may be hard-coded within the query, can make it very difficult to identify queries of the same pattern. Fortunately, this application uses just stored procedures, and as such getting the cumulative execution characteristics is relatively easy.
NOTE:  Parsing and identifying matching ad-hoc SQL Statements is extremely complex and there's no good way to do it in T-SQL. For an app that uses a great deal of ad-hoc SQL, I would recommend a tool like ReTrace to do the import and analysis, as it has the capability to parameterise queries that it reads from a trace file.
The following query will extract the name of the procedure from the text of the statement, using the EXEC keyword to identify where the procedure name starts and a space to find where it ends. This will not work perfectly in all cases, for example when a procedure has a space in the name. As with ad-hoc SQL, there's no easy method that will work in all cases to identify the procedure name, but this does cover the majority of cases.
UPDATE TraceResults
   SET ProcedureName =
   LEFT(
      RIGHT(TextData, LEN(TextData) - CHARINDEX(' ',TextData,CHARINDEX('Exec',TextData))),
      CHARINDEX(' ', RIGHT(TextData, LEN(TextData) - CHARINDEX(' ',TextData,CHARINDEX('Exec',TextData))) + ' ')
   )
where TextData like '%exec%'
Now that I have the procedure name for each of the queries in my trace table, it's trivial to aggregate the execution characteristics of these stored procedure over the sample period, as shown in Figure 5:
Figure 5: Aggregating execution statistics by stored procedure
As you can see, the procedure that had the both the highest total duration and the highest total IO is "ViewThread", with a total duration summed over all executions of 1,815,846 milliseconds, a total number of IOs of 442,711,386 and an execution count of 2377.
It's these procedures that I need to tackle first, in my drive to improve the overall system performance. I've often found, based on my experience performance tuning assorted systems, that optimising the three most intensive procedures (by time, CPU, reads or a combination) will result in dramatically improved performance of the entire system. If overall performance does not improve sufficiently, then I'll look at the next three procedures, and so on.
I do not recommend ever trying to optimise everything in one go. Firstly, trying to optimise all stored procedures, in anything other than a tiny application, will take ages. Secondly, changes made because of one procedure may completely change the execution characteristics of a different procedure and invalidate any work done on that.

Summary

Now I know the worst offenders, I need to take those procedures apart and look at why they are performing so badly. To do that, I'm going to use query statistics and the execution plan from SQL Management Studio. Look out for Part 2!

Source:

No comments:

Post a Comment