08 March 2010

Why did I use jQuery to open links in a new window?

To make my XHTML pages validate, of course. Using a plain old 'target=blank' in links isn't allowed any more, is it? Every web developer seems to have an explanation or a workaround for the problem but, honestly, I've given up trying to understand what the fuss is about.

I agree that page validation is a useful exercise but this is such a small issue, I was tempted to just ignore it. Honestly, if 'target=blank' is your only validation error, who's going to care? Well, the reason I did care was that the links in question were in the master page which meant they showed up on just about every page on the site. Always getting that silly red light on my status bar makes it a lot harder to tell if there's anything else wrong with each page. Something that really matters, something that might break the page come the next browser release. In the long run, it's easier to fix the links than put up with the errors.

So what did I do about it? Out of the many options I could choose from, I started with jQuery because it makes everything pretty simple and we were already using the library anyway. I used a 'rel=external' attribute to mark the links that should open new windows, as suggested by a bunch of sites. It feels semantically correct even though external isn't one of the values defined by the W3C. Next, the jQuery selector dynamically adds 'target=blank' back into the links. As you'd expect, the browser does the right thing, including opening the link in the same window if scripting is disabled. (If you don't allow scripting, I suppose that's the behavior you expect to see.)

If you get the impression we're pretending to fool the validator while it looks the other way and pretends it didn't see anything, then you've got the right idea. Still, now I've got green lights all the way and that's really what I was looking for.

01 March 2010

Running selects in a scalar-valued function = FAIL

Guess what, scalar-valued functions are used a lot in where clauses. That means they have to be evaluated for every candidate row. No, I don't mean just each row in the result set, I mean every row that SQL Server thinks might end up in the result set. In order to figure out if a row belongs in the result set, the engine must execute the function with that row's values. That's usually a lot of executions because, in principle, the select can return the whole table. Some rows might be discarded early due to other where conditions but you get the idea.

To end up with bad performance, all you need is a table with a few million rows. Now imagine running your function for a significant portion of those rows. How many executions is that? If the function performs a simple calculation on the parameter values that's not too bad. After all, computers can perform billions (or is it trillions?) of operations per second these days, can't they? The real problem occurs when the function has to run another select. If this second select involves a large number of rows by itself, now the main select has a few orders of magnitude more work to do. But even if the second select only has to look at a single row, you're still left with a significant number of random I/O reads that can bring your main select to a crawl.

So what's the answer? Since a scalar-valued function can contain an endless combination of T-SQL statements, SQL Server has a limited ability to adjust the execution plan of the function's select to optimize the overall performance of the main select. If you can join the function's select into the main query, the engine stands a much better chance of finding an efficient plan. But what about the encapsulation and re-usability of the select in the function? Do you have to throw that away in the name of performance? Not really, there are other ways you can achieve the same goals. You might want to try a view or a stored procedure instead. Even better, have you heard of table-valued functions?

If you use one of them, the overall structure of your query will look very different and that's not a bad thing. The idea of using a non-trivial scalar-valued function in a select sounds very imperative anyway - the kind of thing that appeals to an applications programming mindset. In a relational database, you're usually better off using a set based approach.

22 February 2010

Class inheritance in relational databases

Here's my rather crude definition of class inheritance: the ability to create a base class encapsulating the common elements of a bunch of other classes while letting developers focus on a higher level abstraction. If you write code in any OO language you probably use inheritance every so often. That's fine, it works great and we all understand it pretty well but what happens if your objects have to the stored in a database? Well maybe not your objects per se but the data they store and represent. You probably use a relational database already but it doesn't have any support for this so-called inheritance. Now what?

Actually, there is a well known technique for implementing the same idea in relational databases. Joe Celko explains all the little details in his great book so, if you can, go check it out. Anyhow, this is how I would do it.

Let's use an example. A database for a bicycle inventory system stores each type of bicycle part in its own table: Frame, Wheel, Pedal, etc. That makes sense since each part requires entirely different columns. The system has to track if a frame is for men or women, if a pedal requires special shoes, and if a wheel has spokes or not. (By the way, an alternative would be to store everything in a very wide table with lots of nullable columns but that has its own problems and doesn't use inheritance.) However, the details of a sale (quantity, date, etc.) are the same for every part. Do we really need a Sale table that has nullable FrameId, WheelId and PedalId columns? There must be a better way.

First of all, we need a Part table - that will be our base. Next, every row in Frame, Wheel or Pedal has to have a correspondent in Part. The best way to do that is to share ids. If we're going to do that, someone has to be in charge of issuing new ids and the logical candidate is the base table. How about this:

create table Part (
    id int identity not null primary key
)

create table Frame (
    id int not null primary key references Part
    -- other Frame columns
)

create table Wheel (
    id int not null primary key references Part
    -- other Wheel columns
)

create table Pedal (
    id int not null primary key references Part
    -- other Pedal columns
)
Great, now Sale can reference Part without caring whether the part is a wheel or a frame. We have a base, we have some derived tables and every derived row is also a member of the base table. That looks like inheritance to me. Job done! Let's move on.

Hold on a second, on closer inspection things look a bit messy. We couldn't select on Sale and Part and say if a wheel was sold even if we wanted to. We would have to join it to Wheel and see if a match was found or not. In other words, there's no way to tell what type a part is just by looking at the Part table. What we need is a part type:

create table PartType (
    typeId int not null primary key,
    typeName varchar(20) not null
)

insert PartType values (1, 'Frame')
insert PartType values (2, 'Wheel')
insert PartType values (3, 'Pedal')

alter table Part
    add typeId int not null references PartType
That's more like it. With this modification, the base table can assert that each row is actually of a specific derived type. If we add a row to Part like this:
insert Part(typeId) values (3)
    -- the row is assigned Part.id = 1
We know that the new row is a Pedal. Wait a minute, does that mean that I can't add a row to Wheel with id 1? Actually, you can and that would be weird. The Part table would say that part id 1 is a pedal but Wheel would say it's a wheel. Maybe it's a new hybrid the PedalWheel, or the WheelPedal?

We are still missing the final piece of the puzzle that brings it all together. Adding a row to the wrong derived table should not be allowed. We can stop that from happening by changing the foreign keys and adding a new constraint. Let's drop the tables and start again:

drop table Frame
drop table Wheel
drop table Pedal
drop table Part
drop table PartType

create table PartType (
    typeId int not null primary key,
    typeName varchar(20) not null
)

insert PartType values (1, 'Frame')
insert PartType values (2, 'Wheel')
insert PartType values (3, 'Pedal')

create table Part (
    id int identity not null primary key,
    typeId int not null references PartType,
    constraint u_Part unique (id, typeId)
)

create table Frame (
    id int not null primary key,
    typeId int not null,
    -- other Frame columns
    constraint fk_Frame_Part foreign key (id, typeId)
    references Part (id, typeId),
    constraint ck_Frame_Type check (typeId = 1)
)

create table Wheel (
    id int not null primary key,
    typeId int not null,
    -- other Wheel columns
    constraint fk_Frame_Part foreign key (id, typeId)
    references Part (id, typeId),
    constraint ck_Wheel_Type check (typeId = 2)
)

create table Pedal (
    id int not null primary key,
    typeId int not null,
    -- other Pedal columns
    constraint fk_Frame_Part foreign key (id, typeId)
    references Part (id, typeId),
    constraint ck_Pedal_Type check (typeId = 3)
)
We've added the typeId to the derived tables. At first, that seems unnecessary because each table will always have the same value in that column. Every row in Wheel will have typeId = 2. We need the column to take part in the foreign key and so we can apply the check constraint. Also note that, even though id alone is the primary key of Part, we need to specify a unique constraint on the pair (id, typeId) so it can be used as an alternate key. (Trivia: the SQL spec defines table constraints that can check values in other tables which would eliminate typeId from the derived tables. Unfortunately SQL Server doesn't support that feature.)

There you have it: The model is flexible and can deal with pretty much any single inheritance scenario. It's also robust and guards against most types of invalid data. The only outstanding problem is that it's possible to add a row to the base table without a correspondent in the correct derived table. Unfortunately that's not something that can be addressed by the model (not that I'm aware of) but there are a few ways to deal with the issue. Personally, I would use stored procedures for the inserts, updates, and deletes. If T-SQL isn't your thing, try building the correct logic into the data access layer of your application instead.

Now go have a good look at your database model. Can you find any tables that could this technique?

15 February 2010

Speed up your queries: convert dates into ids

This is a quick tip on query performance tuning. As always, don't forget to measure the performance against real data before applying any changes to a production environment.

Slow running queries, especially reports, often include a seemingly inoffensive range of dates in the where clause. Ultimately, the end-user wants to know "how many sales were completed last week" or "how many products were added in January". Take this simple example:
select *
from T1
where created_on between @startDate and @endDate
-- other criteria
This query needs an index on the created_on column to perform well. However, if the where clause also references other columns, the optimizer might decide to ignore that index and use a clustered index scan instead, mainly to avoid bookmark lookups. The end result is that the query has to read most, if not all, of the table to find candidate rows. This issue only gets worse if the query joins other tables using the primary key.

As long as the date range uses dates in the same order as the primary key and clustered index, we can substitute it for a range of ids. Generally, a column that stores the date and time of when the row was added is a good candidate for this technique. Let's try rewriting the query like this:

declare @startId int
select top 1 @startId = id
from T1
where created_on >= @startDate
order by created_on, id

declare @endId int
select top 1 @endId = id
from T1
where created_on <= @endDate
order by created_on desc, id desc

select *
from T1
where id between @startId and @endId
-- other criteria
It looks a lot more complicated but the basic idea is simple. First, we figure out the id of the first row in the date range. Then we do the same for the last row. These two queries will use the same index on created_on as the original query but they will run very fast since it's a simple index seek. Finally, the main query uses the same logic as before except with start and end ids instead of dates. Now the clustered index scan can skip most of the table and concentrate on the rows that are in the correct range.

That's it! The query should now run considerably faster with only a slight increase in complexity.

08 February 2010

The table with a tiny key

Whenever I model a new database table I like to start with an auto-incrementing primary key column. Maybe it's the Ruby on Rails convention creeping up on me, but I've even adopted the minimalist id as my default name for such columns. A lot of times I go ahead and make it an int. Sounds pretty good, huh? After all, the good old int is the de facto standard when it comes to integer data types. What else would you use, right?

Since you ask, there are a few alternatives out there: smallints, tinyints and bigints. Let's start with bigints. I don't think the name does it justice. Bigints aren't just big - they are massive. Massive in the range they can represent but also pretty hefty in the storage space they require: 8 bytes instead of the usual 4 an int takes. How can an additional 4 bytes be a problem? We'll come to that in a second but for right now, just ask yourself if your table will eventually have more than 2 billion rows. In most cases, that's a very reasonable upper limit. For example, if you started with an empty table and added a new row every second it would still take you approximately 60 years to reach that limit.

OK, so you probably don't need a bigint unless you're Google, Flickr or maybe Walmart. What about the other data types? Smallints and tinyints can't represent the large range of numbers that ints can. To refresh your memory, smallints go up to a little over 32,000 and of course tinyints a mere 255. And that's precisely where things get interesting. You see, the truth is, any database has a bunch of tables that have very few rows each. They might be domain tables like MaritalStatus ("Single", "Married", "Divorced", etc.) or PhoneType ("Work", "Home" or "Cell/Mobile"). Other tables might be limited by the nature of the data they store. For example, a table called WorldAirports is probably going to have less than 50,000 rows. My point is why use an int as a unique identifier for each row if you're not going to have that many rows? Why not use a smallint and save 2 bytes per row. Even better, use a tinyint and chop 3 bytes off every row. If that doesn't seem like a lot, try looking at it this way: a tinyint requires 75% less space than an int.

Now, wait a minute! What's the point of optimizing a table if it only has a few rows anyway? The total savings are going to be negligable! Well, not quite... The magic of these small tables is that their primary keys get used as foreign keys with the same data type in much larger tables. Let's say you have a PhoneDirectory table with one million phone numbers and each phone number has a PhoneTypeId that's a foreign key from PhoneType. The PhoneDirectory.PhoneTypeId column alone would take up 4MB of storage as an int but only 1MB as a tinyint. Not bad, huh?

But disk space is cheap and it's measured in terabytes these days. One megabyte, four megabytes, who cares? While there's certainly a lot to be said for throwing more hardware (in this case, storage) at a problem, those savings can add up and remember that a smaller database doesn't just mean less disk usage. It also means faster backups and restores, less replication errors, and smaller bandwidth usage for remote archiving and DR. Did I mention better performance? That's right, smaller rows means you can fit more of them in each page of data and therefore reduce the number of reads and writes.

Next time you create a table, have a think about that and choose your data types accordingly. Or get started on refactoring your existing tables right now.

01 February 2010

Data Object Interfaces in Linq

I'm currently working on a database where a lot of tables have a common group of columns, like this:
created_on datetime not null,
created_by smallint not null,
updated_on datetime not null,
updated_by smallint not null
The smallints are foreign keys from the User table. Together they provide first level auditability to answer questions like "When was each row created and by whom?" and "Which user made the last edit to this entity?". They're present in every table that has CRUD operations exposed in the application (that includes User itself). While the columns are always exactly the same, I prefer to keep them inline rather than make a separate table since I don't think they form an entity in their own right.

Anyway, there are places in the application where I just want to deal with those columns regardless of which table they came from. The footer that says "Created by user1 on ..." is a perfect example. The aim is to have all the classes that Visual Studio automatically generates implement an interface that encapsulates the columns above. Something like this:

public interface AuditInformation {
    DateTime created_on { get; }
    short created_by { get; }
    DateTime updated_on { get; }
    short updated_on { get; }
}
How can we achieve that? Remember that those classes are re-written every time you use the design surface so directly editing them is a clear no-no. Well, if all your tables use columns defined exactly the same, the classes already implement the interface - you just have to let the compiler know that. (This is where being very consistent in your naming and choice of datatypes can pay off nicely.)

Instead of editing the classes themselves, we can extend them using partial definitions. Consider a Product table and class. We can add this code to a separate file:

public partial class Product : AuditInformation { }
Is that it? Actually yes! Like I said, if you created your columns correctly, the auto-generated class already implements the interface because it contains exactly the right properties. Now your CreatedByFooter user control (or helper method) can reference the interface without caring which class is really being used.

For extra bonus points notice that you aren't limited to the column values. You can also use the relationship properties generated from the foreign keys. In other words, the Product class probably contains a User property that corresponds to the created_by column (returning a User object in its full glory instead of a plain user id value). You can add that property to the interface too. The only tricky bit is that you have to manually fix the relationship names in Visual Studio when you import the tables. In the example above, by default, the created_by relationship is called User and the updated_by User1 (sometimes, the other way around). That's not very intuitive so rename the relationships to something sensible like CreatedBy and UpdatedBy and add those same names to the interface.

That's it, job done!

25 January 2010

The 15 Minute Backup Strategy

For a long time our only backup strategy was to take a full backup of the database every night and try not to worry about it too much. Eventually the database grew to a point where the job that copied the backups to off-site storage couldn't keep up any more, not to mention that all that copying was consuming some serious bandwidth. Then some bright spark realized we couldn't really afford to lose a whole day of work if the server ever crashed and we had to restore the last backup. We obviously needed a better plan.

While any good DBA will tell you there are a million things to consider when deciding on your backup strategy, we needed a plan that was simple enough for a developer to implement and maintain. It also had to create small(er) backups and provide better coverage. Too complicated? Well, we came up with a pretty good solution.

To improve our coverage, we could take more frequent transaction log backups since they are small and fast even with a reasonably large database. The only problem is that log backups have to be restored sequentially on top of the correct full backup. If one of those files is corrupt, all the data from that file onwards is lost. So, we didn't want too many log backups.

On the other hand, we had to reduce the frequency of the full backups to ease the load on the bandwidth. That's when we turned to differential backups to fill the gap. Differential backups are cumulative i.e., they include any changes that occurred since the last full backup. To restore, you only need to apply the last differential on top of the correct full backup. The downside is that, over time, the differential backup will grow to the same size as the full.

This lead to our three-pronged approach. We decided to have:

  • full backups once a week
  • differential backups every 4 hours
  • transaction log backups every 15 minutes
In the event of a crash, we lose at most 15 minutes worth of data. The backup set shrunk in size from 7 full backups a week to 1 full backup, 1 differential backup and, at most, 15 transaction log backups (to cover the period between differentials). The restore procedure is a bit more complicated than just restoring a single full backup but not by much. There are up to 17 files to be restored and this can easily be automated by a stored procedure or script. Overall, it's a pretty good improvement over the previous strategy.