Monday, January 9, 2012

Understanding Perfmon Counters while troubleshooting SQL Server Performance Issues


To troubleshoot overall Database system performance issue, analyzing performance counters is the best way to start.    By collecting performance counters during busy period for few days consistently and analyzing those data would give a better idea about overall system problems regarding Memory, CPU, and/or Disk I/O.   Please note, for troubleshooting a particular SQL problem such as a stored procedure or a piece of T-SQL, it is better to look at the query execution plan and SQL Trace data and identify the need of redesigning a query or table indexes.    
 Following are some key performance counters to use while assessing a performance issues on SQL Server. 

Memory:

Memory and Disk I/O complements each other.   Memory issues on the system could affect disk I/O and vice versa.   It is very critical to carefully observe the trend of performance counters data over a long period of time to identify the real problem.     
·         Memory\Available Mbytes
Threshold: A consistent value of less than 20 to 25 percent of installed RAM is an indication of insufficient memory.
Significance: This indicates the amount of physical memory available to processes running on the computer. Note that this counter displays the last observed value only. It is not an average.

·         Memory\Page Reads/sec
Threshold: Sustained values of more than five indicate a large number of page faults for read requests.
Significance: This counter indicates that the working set of your process is too large for the physical memory and that it is paging to disk. It shows the number of read operations, without regard to the number of pages retrieved in each operation. Higher values indicate a memory bottleneck.
If a low rate of page-read operations coincides with high values for Physical Disk\% Disk Timeand Physical Disk\Avg. Disk Queue Length, there could be a disk bottleneck. If an increase in queue length is not accompanied by a decrease in the pages-read rate, a memory shortage exists.

·         Memory\Page Input/Sec:
Threshold: The value should not exceed 15.  Higher the value poor the performance.
Significance:  Pages Input/sec is the rate at which pages are read from disk to resolve hard page faults. Hard page faults occur when a process refers to a page in virtual memory that is not in its working set or elsewhere in physical memory, and must be retrieved from disk.

·         Memory\Pages/sec
Threshold: Sustained values higher than five indicate a bottleneck.
Significance: This counter indicates the rate at which pages are read from or written to disk to resolve hard page faults. To determine the impact of excessive paging on disk activity, multiply the values of the Physical Disk\ Avg. Disk sec/Transfer and Memory\ Pages/sec counters.  If the product of these counters exceeds 0.1, paging is taking more than 10 percent of disk access time, which indicates that you need more RAM.  If this occurs over a long period, you probably need more memory.
A high value of Pages/sec indicates that your application does not have sufficient memory. The average of Pages Input/sec divided by average of Page Reads/sec gives the number of pages per disk read. This value should not generally exceed five pages per second. A value greater than five pages per second indicates that the system is spending too much time paging and requires more memory (assuming that the application has been optimized). 

Processor

To measure processor utilization and context switching, you can use the following counters:  
·         Processor\% Processor Time
Threshold: The general figure for the threshold limit for processors is 65 percent.
Significance: This counter is the primary indicator of processor activity. High values many not necessarily be bad. However, if the other processor-related counters are increasing linearly such as % Privileged Time or Processor Queue Length, high CPU utilization may be worth investigating.

·         Processor\% Privileged Time
Threshold: A figure that is consistently over 75 percent indicates a bottleneck.
Significance: This counter indicates the percentage of time a thread runs in privileged mode. When your application calls operating system functions (for example to perform file or network I/O or to allocate memory), these operating system functions are executed in privileged mode.

·         System\Context Switches/sec
Threshold: As a general rule, context switching rates of less than 5,000 per second per processor are not worth worrying about. If context switching rates exceed 15,000 per second per processor, then there is a constraint.
Significance: Context switching happens when a higher priority thread preempts a lower priority thread that is currently running or when a high priority thread blocks. High levels of context switching can occur when many threads share the same priority level. This often indicates that there are too many threads competing for the processors on the system. If you do not see much processor utilization and you see very low levels of context switching, it could indicate that threads are blocked.  

Disk I/O

To measure disk I/O activity, you can use the following counters:  
·         PhysicalDisk\Avg. Disk Queue Length
Threshold: Should not be higher than the number of spindles plus two.
Significance: This counter indicates the average number of both read and writes requests that were queued for the selected disk during the sample interval.

·         PhysicalDisk\Avg. Disk Read Queue Length
Threshold: Should be less than two.
Significance: This counter indicates the average number of read requests that were queued for the selected disk during the sample interval.

·         PhysicalDisk\Avg. Disk Write Queue Length
Threshold: Should be less than two.
Significance: This counter indicates the average number of write requests that were queued for the selected disk during the sample interval.

·         PhysicalDisk\Avg. Disk sec/Read
Threshold: No specific value.
      Less than 10 ms – very good
      Between 10-20 ms – okay
      Between 20-50 ms – slow, needs attention
      Greater than 50 ms – Serious I/O bottleneck

Significance: This counter indicates the average time, in seconds, of a read of data from the disk.

·         PhysicalDisk\Avg. Disk sec/Transfer
Threshold: Should not be more than 18 milliseconds.
Significance: This counter indicates the time, in seconds, of the average disk transfer. This may indicate a large amount of disk fragmentation, slow disks, or disk failures. Multiply the values of the Physical Disk\Avg. Disk sec/Transfer and Memory\Pages/sec counters. If the product of these counters exceeds 0.1, paging is taking more than 10 percent of disk access time, so you need more RAM.

·         PhysicalDisk\Disk Writes/sec
Threshold: Depends on manufacturer's specification.
Significance: This counter indicates the rate of write operations on the disk.

·         Physical Disk: %Disk Time
Threshold:  Greater than 50 percent, it represents an I/O bottleneck
Significance: Represents the percentage of elapsed time that the selected disk drive was busy servicing read or write requests.

·         Physical Disk\Avg. Disk Reads/Sec  and Physical Disk\Avg. Disk Writes/Sec
Threshold: It should be less than 85% of the disk capacity
Significance: It represents the rate of read operations on the disk.
When using above counters, you may need to adjust the values for RAID configurations using the following formulas.
Raid 0 -- I/Os per disk = (reads + writes) / number of disks
Raid 1 -- I/Os per disk = [reads + (2 * writes)] / 2
Raid 5 -- I/Os per disk = [reads + (4 * writes)] / number of disks
Raid 10 -- I/Os per disk = [reads + (2 * writes)] / number of disks
For example, you have a RAID-1 system with two physical disks with the following values of the counters.
Disk Reads/sec            80
Disk Writes/sec           70
Avg. Disk Queue Length    5
In that case, you are encountering (80 + (2 * 70))/2 = 110 I/Os per disk and your disk queue length = 5/2 = 2.5 which indicates a border line I/O bottleneck.
Disk I/O issues can be minimized by having larger and sequential I/O activities.    To achieve than you should consider partitioning.    Partitioning can be done in many ways – Database partitioning, Tempdb partitioning, Table partitioning, index partitioning.    At a minimum, it is highly recommended to partition your database in such a way that clustered index, non-clustered index, and transaction logs are each on a separate physical drive and spread across multiple logical filegroups/files within that drive.   

Sources:
http://blogs.technet.com/b/vipulshah/archive/2006/11/30/understanding-perfmon-counters-while-troubleshooting-sql-server-performance-issues.aspx

No comments:

Post a Comment