About Database Administration (DBA): January 2012

Monday, January 30, 2012

Indexing - take the hint and leave it to the experts

The most common T-SQL command in use has to be the SELECT statement, it is the bedrock of any SQL Professional's day. Sometimes it's used to snatch some data from a table or two while some quick investigation is done, other times it is at the heart of a stored procedure or view that will inform business decisions for coming months or even years.
The latter purpose means you should spend some time making sure it is as efficient as possible. Not endless hours to save a millisecond or two (I mentioned that in my last blog on balancing effort with expected reward) but a little while making sure it is pretty good.
During this time you might read through the MSDN details on the SELECT statement (http://msdn.microsoft.com/en-us/library/ms176104.aspx ) and that might lead you to the Query Hint help (http://msdn.microsoft.com/en-us/library/ms181714.aspx). Now as a diligent DBA you are looking at your execution plans and may want to guide the query to use a particular index that you know exists on the table but for some stupid reason the query optimiser isn't using. This might be a bad idea. It might be a good idea but, it might be a very bad idea. It isn't often that the average DBA knows better than the optimiser.
Lets look at a simple table and a simple query.

CREATE TABLE [dbo].[AccountFlags]
    (
      [AccountsFlagID] [int] NOT NULL
                             IDENTITY(1, 1) ,
      [Account_Ref] [char](7) NOT NULL ,
      [FlagID] [int] NOT NULL ,
      [FlagValue] [bit] NULL
    )
        ALTER TABLE [dbo].[AccountFlags]        ADD CONSTRAINT [PK_ACCOUNTSFLAGS]        PRIMARY KEY CLUSTERED  ([AccountsFlagID]) ON [PRIMARY]
                CREATE UNIQUE NONCLUSTERED INDEX [IX_ACCOUNTSFLAGS]                ON [dbo].[AccountFlags]([Account_Ref],[FlagID]) ON [PRIMARY]
        CREATE NONCLUSTERED INDEX [IX_ACCOUNTSFLAGS2]        ON [dbo].[AccountFlags] ([Account_Ref],[FlagID] DESC)ON [PRIMARY]
    

    CREATE NONCLUSTERED INDEX [IX_ACCOUNTSFLAGS_IDValue]    ON [dbo].[AccountFlags] ([FlagID], [FlagValue])    INCLUDE ([Account_Ref]) ON [PRIMARY]

NOTE: Now this post isn't about what indexes to create, how to create them or any such intricacies, it is purely to show how an query hint suggesting a certain index is used can disrupt TSQL execution immediately and in an on-going fashion mislead your decisions and as a DBA. If you want explanations of what indexes to create then you can find plenty of that advise in other blogs and Microsoft content.
On our table we have plenty of indexes covering the columns we are querying so we should be seeing really fast query execution. However, if you have a query that has an query hint in it as follows

SELECT  FlagValueFROM    dbo.[AccountFlags] AS af WITH ( INDEX ( PK_AccountsFLAGS ) )WHERE   FlagID IN ( 1, 5 )
        AND Account_Ref = 'HAYNE04'

then the results of running the query in SSMS will be misleading and the effect of running it within a production system will draw heavily on your server's resources

For a start the missing index suggestion will show in the results pane and that missing index suggestion is always the same.
CREATE NONCLUSTERED INDEX []
ON[dbo].[AccountFlags]
([Account_Ref],[FlagID])
Despite there being an index that matches this exactly.By using the query hint you are forcing the query to run in a particular way and the table doesn't get evaluated for existing indexes so the suggestion is always made to have an index created that would help.
Now the query above has a cost of 10.0609, if we remove the query hint however the cost drops to 0.0075, some 1300 times less effort. This is born out by the execution times of 1196ms for the querywith the hint and 26ms for the one without. The results of using SET STATISTICS IO ON is another compelling argument too:

With query hint -
Table 'AccountFlags'. Scan count 5, logical reads 9461, physical reads 167, read-ahead reads 9397, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Without query hint -
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'AccountFlags'. Scan count 1, logical reads 3, physical reads 3, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

This is clearly causing a lot of problems:

The index usage statistics show a high occurrences of Primary Key scans.
Having the index related DMVs in SQL Server has been a godsend to anyone working on indexes, knowing how indexes are being used (or not) and what indexes might be useful if created based on actual usage statistics is great. If the collection of those statistics is skewed by inappropriate query hints then all bets are off. You have no idea what indexes would actually be any use.
The missing index DMVs recommend an index that is already in existence but there is nothing preventing that index being duplicated.
As explained above, having bad statistics is worse than having no statistics. You wont know your DMV statistics are being skewed by Query hints unless you go and find them so you may well be getting it wrong already.
The database is storing an awful lot more data than it might need to. If you are following the advice of the missing index information in the query plan then you could be creating index upon index on your tables. These will be having no effect on improving your performance as the query hint will prevent them being used and they will be taking up a lot of space. This is space that is being updated every time an INSERT, UPDATE or DELETE takes place and also affects your Backup and Restore times and may even force you onto more hardware in order to cope.
The query is still performing badly.
In this case it is ~~only (!)~~ 1300 times worse than it might be. If you have more complex tables then you will suffer even more.

My advice? Leave Query Hints to extreme cases and apply them with advice from an index expert. Going it alone means you could be making hard work of your data access. Do you hate your servers that much?

Sources:
http://www.simple-talk.com/community/blogs/jonathanallen/archive/2012/01/10/105306.aspx

Monday, January 9, 2012

How to repair a Suspect Database in SQL Server

There can be a possibility that when you connect to an SQL Server Instance you will find the database being marked as SUSPECT. During such scenarios, you will not be able to connect to the database. In this article we will go through the steps which you need to follow to recovery a database which is marked as SUSPECT. The steps mentioned in this article works on SQL Server 2005 and higher versions.

Some of the reasons why an SQL Server database can be marked as SUSPECT

1. Database could have been corrupted.
2. There is not enough space available for the SQL Server to recover the database during startup.
3. Database cannot be opened due to inaccessible files or insufficient memory or disk space.
4. Database files are being held by operating system, third party backup software etc.
5. Unexpected SQL Server Shutdown, Power failure or a Hardware failure.

Due to a hardware failure one of our database namely BPO was marked SUSPECT when the SQL Server came back online. Already due to the hardware failure we had downtime for more than two hours and adding to that when the server came back online our mostly critical database was marked as SUSPECT.

Steps to Recover a Database Marked as SUSPECT

1. Execute the below mentioned TSQL code to identify all the databases which are marked as SUSPECT.

USE master
GO

SELECT NAME,STATE_DESC FROM SYS.DATABASES
WHERE STATE_DESC='SUSPECT'GO

2. Open the latest SQL Server Error Log and check for errors logged for the database which is marked as suspect. You can open SQL Server Error Log by expanding Management Node à SQL Server Error Logs. In my server I could find below mentioned entries in SQL Server Error Logs.

Sample Error Messages within SQL Server Error Log when database is marked as SUSPECT

Starting up database 'BPO'.
Error: 9003, Severity: 20, State: 9.

The log scan number (189624:16:2) passed to log scan in database 'BPO' is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.

Error: 3414, Severity: 21, State: 1.

An error occurred during recovery, preventing the database 'BPO' (database ID 10) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

CHECKDB for database 'BPO' finished without errors on 2009-12-15 11:30:28.320 (local time). This is an informational message only; no user action is required.

3. When a database is in SUSPECT mode you will not be able to get connected to the database. Hence you need to bring the database first in EMERGENCY mode to repair the database. Execute the below mentioned TSQL code to bring the database in EMERGENCY mode.

USE master
GO

ALTER DATABASE BPO SET EMERGENCY
GO

Once the database is in EMERGENCY mode you will be able to query the database.

4. Execute the DBCC CHECKDB command which will check the logical and physical integrity of all the objects within the specified database.

DBCC CHECKDB (BPO)
GO

5. Next step will be to bring the user database in SINGLE_USER mode by executing the below mentioned TSQL code.

ALTER DATABASE BPO SET SINGLE_USER WITH ROLLBACK IMMEDIATE
GO

6. Once the database is in SINGLE_USER mode execute the below TSQL code to repair the database. When you repair your database using REPAIR_ALLOW_DATA_LOSS option of DBCC CHECKDB command there can be some loss of data. Once the database is successfully repaired using REPAIR_ALLOW_DATA_LOSS option of DBCC CHECKDB command then there is no way to go back to the previous state.

DBCC CHECKDB (BPO, REPAIR_ALLOW_DATA_LOSS)
GO

7. Finally, execute the below mentioned TSQL command to allow MULTI_USER access to the database.

ALTER DATABASE BPO SET MULTI_USER
GO

Conclusion

In this article you have seen how you can recover a database which is marked SUSPECT.

Sources:
http://www.mytechmantra.com/LearnSQLServer/Repair_Suspect_Database_P1.html

Understanding Perfmon Counters while troubleshooting SQL Server Performance Issues

To troubleshoot overall Database system performance issue, analyzing performance counters is the best way to start. By collecting performance counters during busy period for few days consistently and analyzing those data would give a better idea about overall system problems regarding Memory, CPU, and/or Disk I/O. Please note, for troubleshooting a particular SQL problem such as a stored procedure or a piece of T-SQL, it is better to look at the query execution plan and SQL Trace data and identify the need of redesigning a query or table indexes.

Following are some key performance counters to use while assessing a performance issues on SQL Server.

Memory:

Memory and Disk I/O complements each other. Memory issues on the system could affect disk I/O and vice versa. It is very critical to carefully observe the trend of performance counters data over a long period of time to identify the real problem.

· Memory\Available Mbytes

Threshold: A consistent value of less than 20 to 25 percent of installed RAM is an indication of insufficient memory.

Significance: This indicates the amount of physical memory available to processes running on the computer. Note that this counter displays the last observed value only. It is not an average.

· Memory\Page Reads/sec

Threshold: Sustained values of more than five indicate a large number of page faults for read requests.

Significance: This counter indicates that the working set of your process is too large for the physical memory and that it is paging to disk. It shows the number of read operations, without regard to the number of pages retrieved in each operation. Higher values indicate a memory bottleneck.

If a low rate of page-read operations coincides with high values for Physical Disk\% Disk Timeand Physical Disk\Avg. Disk Queue Length, there could be a disk bottleneck. If an increase in queue length is not accompanied by a decrease in the pages-read rate, a memory shortage exists.

· Memory\Page Input/Sec:

Threshold: The value should not exceed 15. Higher the value poor the performance.

Significance: Pages Input/sec is the rate at which pages are read from disk to resolve hard page faults. Hard page faults occur when a process refers to a page in virtual memory that is not in its working set or elsewhere in physical memory, and must be retrieved from disk.

· Memory\Pages/sec

Threshold: Sustained values higher than five indicate a bottleneck.

Significance: This counter indicates the rate at which pages are read from or written to disk to resolve hard page faults. To determine the impact of excessive paging on disk activity, multiply the values of the Physical Disk\ Avg. Disk sec/Transfer and Memory\ Pages/sec counters. If the product of these counters exceeds 0.1, paging is taking more than 10 percent of disk access time, which indicates that you need more RAM. If this occurs over a long period, you probably need more memory.

A high value of Pages/sec indicates that your application does not have sufficient memory. The average of Pages Input/sec divided by average of Page Reads/sec gives the number of pages per disk read. This value should not generally exceed five pages per second. A value greater than five pages per second indicates that the system is spending too much time paging and requires more memory (assuming that the application has been optimized).

Processor

To measure processor utilization and context switching, you can use the following counters:

· Processor\% Processor Time

Threshold: The general figure for the threshold limit for processors is 65 percent.

Significance: This counter is the primary indicator of processor activity. High values many not necessarily be bad. However, if the other processor-related counters are increasing linearly such as % Privileged Time or Processor Queue Length, high CPU utilization may be worth investigating.

· Processor\% Privileged Time

Threshold: A figure that is consistently over 75 percent indicates a bottleneck.

Significance: This counter indicates the percentage of time a thread runs in privileged mode. When your application calls operating system functions (for example to perform file or network I/O or to allocate memory), these operating system functions are executed in privileged mode.

· System\Context Switches/sec

Threshold: As a general rule, context switching rates of less than 5,000 per second per processor are not worth worrying about. If context switching rates exceed 15,000 per second per processor, then there is a constraint.

Significance: Context switching happens when a higher priority thread preempts a lower priority thread that is currently running or when a high priority thread blocks. High levels of context switching can occur when many threads share the same priority level. This often indicates that there are too many threads competing for the processors on the system. If you do not see much processor utilization and you see very low levels of context switching, it could indicate that threads are blocked.

Disk I/O

To measure disk I/O activity, you can use the following counters:

· PhysicalDisk\Avg. Disk Queue Length

Threshold: Should not be higher than the number of spindles plus two.

Significance: This counter indicates the average number of both read and writes requests that were queued for the selected disk during the sample interval.

· PhysicalDisk\Avg. Disk Read Queue Length

Threshold: Should be less than two.

Significance: This counter indicates the average number of read requests that were queued for the selected disk during the sample interval.

· PhysicalDisk\Avg. Disk Write Queue Length

Threshold: Should be less than two.

Significance: This counter indicates the average number of write requests that were queued for the selected disk during the sample interval.

· PhysicalDisk\Avg. Disk sec/Read

Threshold: No specific value.

Less than 10 ms – very good

Between 10-20 ms – okay

Between 20-50 ms – slow, needs attention

Greater than 50 ms – Serious I/O bottleneck

Significance: This counter indicates the average time, in seconds, of a read of data from the disk.

· PhysicalDisk\Avg. Disk sec/Transfer

Threshold: Should not be more than 18 milliseconds.

Significance: This counter indicates the time, in seconds, of the average disk transfer. This may indicate a large amount of disk fragmentation, slow disks, or disk failures. Multiply the values of the Physical Disk\Avg. Disk sec/Transfer and Memory\Pages/sec counters. If the product of these counters exceeds 0.1, paging is taking more than 10 percent of disk access time, so you need more RAM.

· PhysicalDisk\Disk Writes/sec

Threshold: Depends on manufacturer's specification.

Significance: This counter indicates the rate of write operations on the disk.

· Physical Disk: %Disk Time

Threshold: Greater than 50 percent, it represents an I/O bottleneck

Significance: Represents the percentage of elapsed time that the selected disk drive was busy servicing read or write requests.

· Physical Disk\Avg. Disk Reads/Sec and Physical Disk\Avg. Disk Writes/Sec

Threshold: It should be less than 85% of the disk capacity

Significance: It represents the rate of read operations on the disk.

When using above counters, you may need to adjust the values for RAID configurations using the following formulas.

Raid 0 -- I/Os per disk = (reads + writes) / number of disks

Raid 1 -- I/Os per disk = [reads + (2 * writes)] / 2

Raid 5 -- I/Os per disk = [reads + (4 * writes)] / number of disks

Raid 10 -- I/Os per disk = [reads + (2 * writes)] / number of disks

For example, you have a RAID-1 system with two physical disks with the following values of the counters.

Disk Reads/sec 80

Disk Writes/sec 70

Avg. Disk Queue Length 5

In that case, you are encountering (80 + (2 * 70))/2 = 110 I/Os per disk and your disk queue length = 5/2 = 2.5 which indicates a border line I/O bottleneck.

Disk I/O issues can be minimized by having larger and sequential I/O activities. To achieve than you should consider partitioning. Partitioning can be done in many ways – Database partitioning, Tempdb partitioning, Table partitioning, index partitioning. At a minimum, it is highly recommended to partition your database in such a way that clustered index, non-clustered index, and transaction logs are each on a separate physical drive and spread across multiple logical filegroups/files within that drive.

Sources:
http://blogs.technet.com/b/vipulshah/archive/2006/11/30/understanding-perfmon-counters-while-troubleshooting-sql-server-performance-issues.aspx

Wednesday, January 4, 2012

Managing Transaction Logs

The transaction log and how SQL uses it seems to be one of the most misunderstood topics among newcomers to the DBA role. I’m going to see if I can shed a little light on what the transaction log is, why SQL uses it, how to manage it and what not to do to it.

What is the Transaction Log?

At its simplest, the transaction log is a record of all transactions run against a database and all database modifications made by those transactions. The transaction log is a critical part of the database’s architecture.

The transaction log is not an audit log. It’s not there so that the DBA can see who did what to the database. It’s also not a data recovery tool. There are third-party tools that can get audit or data-recovery info from the log, but that is not its primary purpose.

The transaction log is predominantly used by the SQL engine to ensure database integrity, to allow transaction rollbacks and for database recovery.

How does SQL use the log?

When changes are made to a database, whether it be in an explicit transaction or an auto-committed transaction, those changes are first written (hardened) to the log file and the data pages are changed in memory. Once the record of the changes is in the log, the transaction is considered complete. The data pages will be written to the data file on disk at a later time either by the lazy writer or by the checkpoint process.

If transactions are rolled back, either by an explicit ROLLBACK TRANSACTION, by an error if XACT_ABORT is on, or due to a loss of connection to the client, the transaction log is used to undo the modifications made by that transaction and leave the database as though the transaction had never occurred. In a similar way, the log is used to undo the effects of single statements that fail, whether in an explicit transaction or not.

When a server is restarted, SQL uses the transaction log to see if, at the point the server shut down there were any transactions that had completed but whose changes may not have been written to disk, or any transactions that had not completed. If there are, then the modifications that may not have been written to disk are replayed (rolled forward) and any that had not completed are rolled back. This is done to ensure that the database is in a consistent state after a restart and that any transactions that had committed remain a part of the permanent database state (the Durability requirement of ACID)

Lastly, backups made of the transaction log can be used to recover a database to a point-in-time in case of a failure.

The transaction log is also used to support replication, database mirroring and change data capture. I won’t be going into how they affect the log here.

Recovery models and the Transaction Log

The database recovery model does not (with the exception of bulk operations) affect what is written to the transaction log. Rather it affects how long log entries remain in the log. This is just a very basic look at the recovery models. For more detail, see my article “Recovery Models".

Simple Recovery Model

In the simple recovery model, the transaction log entries are kept only to allow for transaction rollbacks and crash recovery, not for the purpose of restoring a database. Once the data pages have been written to disk and the transaction is complete; then, in the absence of replication or other things that need the log, the log records are considered inactive and can be marked as reusable. This marking of portions of the log as reusable is done by the checkpoint process.

This is the simplest recovery mode in terms of log management as the log manages itself. The downside of simple recovery is that (because transaction log backups cannot be made) a restore of the database can only be done to the time of the latest full or differential database backup. With a busy database, this can result in unacceptable data loss.

Full Recovery model

In full recovery model transaction log entries cannot be overwritten until they have been backed up by a transaction log backup. Simply having the transaction committed and data pages written to disk is not enough

Full recovery can be more difficult to manage as the log can grow beyond what is expected if transaction log backups don’t occur, or if there’s an increase in the amount of database activity that occurs between log backups.

Without any log backups running (an all-too common occurrence seeing that new databases will default to full recovery model unless the recovery model of the Model database has been changed), the transaction log will grow until it reaches its configured maximum file size (2TB unless otherwise specified) or until it fills the disk. No amount of full or differential backups will allow the log space to be reused as neither marks log space as reusable.

What can be even worse is that a database in full recovery model does not behave like this from the moment created. When created, a database in full recovery model will behave in a manner sometimes called pseudo-simple recovery model. This occurs because no database backup has yet been taken, and a database backup is needed to start a log chain. While in pseudo-simple recovery, the database behaves as though it really is in simple recovery model, truncating the log (marking space as reusable) every time a checkpoint occurs. This state remains until the first full backup is taken of the database. That full backup starts the log chain and from that point on the log will no longer be marked reusable by the checkpoint process and, if there are no log backups, the log will begin to grow.

Because log records are not overwritten until they have been backed up, a database in full recovery mode can be recovered to any time using a combination of full, differential and log backups, assuming a starting full backup exists and none of the log backups since have been deleted.

Bulk-logged recovery model

Bulk-logged is very similar to full recovery, except that in bulk-logged, bulk operations are minimally logged. When operations are minimally logged, much less information is written to the transaction log compared to when the operation is fully logged.

The advantage of bulk-logged recovery is that if there are bulk operations occurring, the impact those operations have on the transaction log is less than it would be if the database was in full recovery mode. However the transaction log backups may be much larger than the transaction log itself since the log backups include all data pages modified by bulk operations since the previous log backup.

I'm not going to discuss bulk-logged recovery model further than this in the current article. For the purposes of log management, bulk-logged recovery model can be treated much like full recovery.

Managing transaction logs

Managing your transaction log requires you to think about your recovery model, log backups, and various other details regarding the log files.

Picking a recovery model

The key to effectively managing transaction logs is to know what the availability and recovery requirements are for the database. The choice of recovery model should not be chosen because of performance issues or space concerns.

If there is no requirement for point-in-time recovery and it is acceptable, in the case of a disaster, to restore the database to the last full/differential backup, then simple recovery model can be used. In reality, it’s not that common to have a database where the loss of several hours of data is acceptable, so in general simple recovery model should be limited to development or testing environments or databases that can be completely recreated from a source if they fail.

If there is a requirement for point-in-time recovery and minimal or no data loss in the case of a disaster, then the database should be in full or bulk-logged recovery model. This, along with proper storage and retention for the backup files can allow the database to be restored to the point of failure or near to the point of failure in most situations.

If the database is in full or bulk-logged recovery model then log backups must be done. Without log backups the log entries will never be discarded from the log and the log file will grow until it fills the drive. Since one of the main reasons for having a database in full or bulk-logged recovery model is to allow the database to be restored without data loss in the case of a disaster, it’s important to have an unbroken log chain to allow a restore to the point of failure if such a restore becomes necessary. This means that no log backups can be lost or discarded and the database cannot be switched to simple recovery model.

Frequency of log backups

The frequency that log backups should be done is dependent on two considerations:

The maximum amount of data that can be lost in the case of a disaster
The size to which the log can grow.

The first consideration is by far the most important.

When in full recovery model a disaster that destroys or damages the database can be recovered from without data loss, providing the transaction log file is available. If the log file is not available then the best that can be done, assuming that the log backups are stored separately from the database files and are available, is a restore to the last log backup taken, losing all data after that point. From this it should be clear that the interval between log backups should be carefully chosen based on the RPO (recovery point objective) requirements for that database. If the mandate is that no more than 20 minutes of data can be lost, then scheduling log backups hourly is a risk, as a disaster can result in losing up to 60 minutes of data if the transaction log file is lost or damaged.

If (once the log backups have been setup and scheduled based on the database's RPO) the log is growing larger than is acceptable, then the log backup frequency can be increased in order to keep the size down. This is by no means guaranteed to have the desired effect as the log must be large enough to accommodate the largest single transaction run against the database (which is usually an index rebuild).

Log chains

A log chain is an unbroken set of log backups starting with a full or differential backup and reaching to the point where the database needs to be restored. When a log chain is broken, the database cannot be restored to any point past the point at which the log chain was broken until a full or differential backup is taken to restart the log chain.

Switching a database to simple recovery model immediately breaks the log chain and prevents further log backups. Log backups cannot be taken until the database is switched back to full or bulk-logged recovery and a full or differential backup is taken. The database can't be restored to any point between the last log backup prior to the switch to simple recovery and the full or differential backup that re-established the log chain.

Deleting or losing a log backup file breaks the log chain, although SQL doesn't know that the log chain is broken in that case and allows log backups to still be taken. Regardless, if a log backup file is missing the log chain is broken and point-in-time restores can't be done to any point within or after the time range covered by the missing log backup. As with the case of switching to simple recovery model, a full or differential backup must be taken to start a new log chain.

It is important to note that while a full or differential backup starts a log chain, full and differential backups don't break the log chain. If, for example, full backups were taken every 4 hours and transaction log backups hourly and a restore was needed to 5pm, then all of these would be valid, working options

Restore 4pm full backup, 5pm log backup
Restore midday full backup, log backups from 1pm-5pm
Restore 8am full backup, log backups from 9am-5pm
Etc.

Log size

The question of how big a log file should be given a data file of a particular size is a common one, and one almost impossible to easily answer.

There is no single formula that calculates how large a log file should be based on the data file size. The size that the log file needs to be is based on activity (specifically database changes) and, in full and bulk-logged recovery, on the interval between log backups. This is not easy to calculate without performing a load-test of the database with both production-level data volumes and production-level transaction volumes.

At a minimum, the log must be large enough to hold the log records for the largest transaction that will be run against the database. In full recovery that will likely be the index rebuild of the largest clustered index in the database. So, with no other considerations, a reasonable place to start would be around 150% of the size of the largest table in the database with a reasonable growth increment based on the initial size, assuming that the largest table will have its clustered index rebuilt in full recovery model. This may need to be increased based on a number of considerations, including but not limited to

Largest size of the database mirroring send queue
Transaction volume during a full or differential backup
Latency and transaction volume of transactional replication or CDC.

A large log file will not cause performance problems for queries in the database. It is possible that a very large number of Virtual Log Files will result in slower than expected log backups and other operations that read the log, but that's not a problem of large log, that's a result of growing the log file in small intervals

Number of log files

The answer to the question of how many log files a database should have is a simple one. One log file only. SQL uses log files sequentially, not in any form of parallel or Round-Robin mechanism. Hence, if there are multiple log files, SQL will only ever be writing to one at a time.

The one time where there may be a use for a second log file is when unusual database activity needs a total amount of log space larger than what is available on any particular drive or larger than the 2TB limit in the size of a log file. In these circumstances it may be necessary to create a second log file to increase the amount of available log space. This is purely about available log space, not about performance.

Shrinking the log

In general, the transaction log should not be shrunk. It certainly should never be shrunk on a regular basis in a job or maintenance plan.

The only time a log should be shrunk is if some abnormal database activity (or failed log backups) has resulted in the log growing far beyond the size it needs to be for the database activity. In this situation, the log can be shrunk as a once-off operation, reducing it back to the size that it was before the abnormal activity.

Shrinking the log on a regular basis will have just one effect - the log growing again once regular activity on the database requires the old size. Transaction log grow operations are not fast, they cannot take advantage of instant initialisation and hence the new portion of the log will always have to be zeroed out. The other effect of the log repeatedly growing is that unless the auto-grow setting has been carefully chosen, the growth of the log will result in log fragmentation - excessive VLFs that can degrade the performance of backups, restores, crash recovery, replication and anything else that reads the log.

Don't shrink the log regularly. Shrink only if something has blown the size of the log far beyond what it needs to be.

Log fragmentation and VLFs

Internally, the log is divided into sections called Virtual Log Files (VLF). A log will always contain at least 2 VLFs and will usually contain far more. When the log is truncated (checkpoint in simple recovery or log backup in full recovery), only entire VLFs can be marked reusable. SQL can't mark individual log records or log blocks as reusable. A single log record that's part of an open transaction or otherwise needed prevents the entire VLF from being marked as reusable.

When the log is created or the log grows the specified size results in a specific number of VLFs of specific sizes. More details on the exact algorithm can be found on Kimberly Tripp's blog. If the log was improperly sized initially and auto-grew to a huge size, the log can have vast numbers of VLFs (tens of thousands have been seen in production systems). The problem with excessive VLFs is that it can have a massive impact on operations that read the log. These include, but are not limited to, database and log backups, restores, crash recovery, transactional replication, change data capture.

There's no exact number of VLFs that are good or bad. If there are thousands, it's probably bad. Tens of thousands is definitely very bad. Hundreds? That depends on the size of the log. The number of VLFs should be high enough that SQL doesn't have to keep huge portions of the log active but low enough that reading all the VLF headers doesn't take too long.

The fix for excessive VLFs is to shrink the log to 0 and then regrow it in reasonable chunks to its previous size. This obviously has to be done while the database is idle. The size of the grow increments determines how many VLFs will be in the new file, see Kimberly’s blog post on log throughput for details, as well as the previously mentioned blog post: http://www.sqlskills.com/blogs/kimberly/post/8-Steps-to-better-Transaction-Log-throughput.aspx.

The growth increment must not be an exact multiple of 4GB. There's a bug in SQL if that exact size is used. Any other size works as expected.

Log mismanagement

There's a huge amount of advice on various forums and blogs as to management of logs. Unfortunately much of it is incorrect or outright dangerous. To end this article I'll touch on a couple of particularly bad forms and explain why they're so bad.

Detach the database, delete the log file, and reattach the database.

This one's particularly terrible as it can cause unexpected downtime and potentially could even result in the complete loss of the database. SQL cannot always recreate the log if a database is attached without one. A log can only be recreated if the database was shut down cleanly. That means no uncommitted transactions, sufficient time to perform a checkpoint and sufficient log space to perform a checkpoint.

If the database is detached and the log deleted and SQL could not cleanly shut the database down beforehand, the database will fail to reattach necessitating a restore from backup or hacking the database back in and doing an emergency mode repair. Said repair can fail and if it does with no available backup, the database is essentially lost without hope of recovery.

Set the recovery model to Simple, shrink the log to 0 and then set the recovery model back to full

This one's nasty for two reasons.

The switch to simple recovery breaks the log chain, which means that there is no possibility for point-in-time recovery past this point until another full or differential backup is taken to restart the log chain (something most people recommending this don't mention). Even if a full backup is taken immediately after switching back to full recovery, breaking the log chain reduces the options available for restoring. If a full backup taken after the point the log chain was broken is damaged, a full backup from before can't be used without potentially significant data loss.

The other reason is that shrinking the log to 0 will immediately force it to grow. When the log grows it has to be zero-initialised and hence can slow down all data modifications occurring. It can also result in a huge number of VLFs if the auto grow increments are not properly chosen.

Conclusion

The transaction log is a crucial piece of a database and a good understanding of basic log management is an essential skill for any DBA. In this article I've briefly covered what the log is used for and how the recovery models affect that. I've also touched on how the log should and shouldn’t be managed and some other considerations that exist for dealing with the transaction log of a SQL Server database.

Sources:

http://www.sqlservercentral.com/articles/Administration/64582/