15 February 2010

Speed up your queries: convert dates into ids

This is a quick tip on query performance tuning. As always, don't forget to measure the performance against real data before applying any changes to a production environment.

Slow running queries, especially reports, often include a seemingly inoffensive range of dates in the where clause. Ultimately, the end-user wants to know "how many sales were completed last week" or "how many products were added in January". Take this simple example:
select *
from T1
where created_on between @startDate and @endDate
-- other criteria
This query needs an index on the created_on column to perform well. However, if the where clause also references other columns, the optimizer might decide to ignore that index and use a clustered index scan instead, mainly to avoid bookmark lookups. The end result is that the query has to read most, if not all, of the table to find candidate rows. This issue only gets worse if the query joins other tables using the primary key.

As long as the date range uses dates in the same order as the primary key and clustered index, we can substitute it for a range of ids. Generally, a column that stores the date and time of when the row was added is a good candidate for this technique. Let's try rewriting the query like this:

declare @startId int
select top 1 @startId = id
from T1
where created_on >= @startDate
order by created_on, id

declare @endId int
select top 1 @endId = id
from T1
where created_on <= @endDate
order by created_on desc, id desc

select *
from T1
where id between @startId and @endId
-- other criteria
It looks a lot more complicated but the basic idea is simple. First, we figure out the id of the first row in the date range. Then we do the same for the last row. These two queries will use the same index on created_on as the original query but they will run very fast since it's a simple index seek. Finally, the main query uses the same logic as before except with start and end ids instead of dates. Now the clustered index scan can skip most of the table and concentrate on the rows that are in the correct range.

That's it! The query should now run considerably faster with only a slight increase in complexity.

No comments: