The Minimalist Developer: February 2010

22 February 2010

Class inheritance in relational databases

Here's my rather crude definition of class inheritance: the ability to create a base class encapsulating the common elements of a bunch of other classes while letting developers focus on a higher level abstraction. If you write code in any OO language you probably use inheritance every so often. That's fine, it works great and we all understand it pretty well but what happens if your objects have to the stored in a database? Well maybe not your objects per se but the data they store and represent. You probably use a relational database already but it doesn't have any support for this so-called inheritance. Now what?

Actually, there is a well known technique for implementing the same idea in relational databases. Joe Celko explains all the little details in his great book so, if you can, go check it out. Anyhow, this is how I would do it.

Let's use an example. A database for a bicycle inventory system stores each type of bicycle part in its own table: Frame, Wheel, Pedal, etc. That makes sense since each part requires entirely different columns. The system has to track if a frame is for men or women, if a pedal requires special shoes, and if a wheel has spokes or not. (By the way, an alternative would be to store everything in a very wide table with lots of nullable columns but that has its own problems and doesn't use inheritance.) However, the details of a sale (quantity, date, etc.) are the same for every part. Do we really need a Sale table that has nullable FrameId, WheelId and PedalId columns? There must be a better way.

First of all, we need a Part table - that will be our base. Next, every row in Frame, Wheel or Pedal has to have a correspondent in Part. The best way to do that is to share ids. If we're going to do that, someone has to be in charge of issuing new ids and the logical candidate is the base table. How about this:

create table Part (
    id int identity not null primary key
)

create table Frame (
    id int not null primary key references Part
    -- other Frame columns
)

create table Wheel (
    id int not null primary key references Part
    -- other Wheel columns
)

create table Pedal (
    id int not null primary key references Part
    -- other Pedal columns
)

Great, now Sale can reference Part without caring whether the part is a wheel or a frame. We have a base, we have some derived tables and every derived row is also a member of the base table. That looks like inheritance to me. Job done! Let's move on.

Hold on a second, on closer inspection things look a bit messy. We couldn't select on Sale and Part and say if a wheel was sold even if we wanted to. We would have to join it to Wheel and see if a match was found or not. In other words, there's no way to tell what type a part is just by looking at the Part table. What we need is a part type:

create table PartType (
    typeId int not null primary key,
    typeName varchar(20) not null
)

insert PartType values (1, 'Frame')
insert PartType values (2, 'Wheel')
insert PartType values (3, 'Pedal')

alter table Part
    add typeId int not null references PartType

That's more like it. With this modification, the base table can assert that each row is actually of a specific derived type. If we add a row to Part like this:

insert Part(typeId) values (3)
    -- the row is assigned Part.id = 1

We know that the new row is a Pedal. Wait a minute, does that mean that I can't add a row to Wheel with id 1? Actually, you can and that would be weird. The Part table would say that part id 1 is a pedal but Wheel would say it's a wheel. Maybe it's a new hybrid the PedalWheel, or the WheelPedal?

We are still missing the final piece of the puzzle that brings it all together. Adding a row to the wrong derived table should not be allowed. We can stop that from happening by changing the foreign keys and adding a new constraint. Let's drop the tables and start again:

drop table Frame
drop table Wheel
drop table Pedal
drop table Part
drop table PartType

create table PartType (
    typeId int not null primary key,
    typeName varchar(20) not null
)

insert PartType values (1, 'Frame')
insert PartType values (2, 'Wheel')
insert PartType values (3, 'Pedal')

create table Part (
    id int identity not null primary key,
    typeId int not null references PartType,
    constraint u_Part unique (id, typeId)
)

create table Frame (
    id int not null primary key,
    typeId int not null,
    -- other Frame columns
    constraint fk_Frame_Part foreign key (id, typeId)
    references Part (id, typeId),
    constraint ck_Frame_Type check (typeId = 1)
)

create table Wheel (
    id int not null primary key,
    typeId int not null,
    -- other Wheel columns
    constraint fk_Frame_Part foreign key (id, typeId)
    references Part (id, typeId),
    constraint ck_Wheel_Type check (typeId = 2)
)

create table Pedal (
    id int not null primary key,
    typeId int not null,
    -- other Pedal columns
    constraint fk_Frame_Part foreign key (id, typeId)
    references Part (id, typeId),
    constraint ck_Pedal_Type check (typeId = 3)
)

We've added the typeId to the derived tables. At first, that seems unnecessary because each table will always have the same value in that column. Every row in Wheel will have typeId = 2. We need the column to take part in the foreign key and so we can apply the check constraint. Also note that, even though id alone is the primary key of Part, we need to specify a unique constraint on the pair (id, typeId) so it can be used as an alternate key. (Trivia: the SQL spec defines table constraints that can check values in other tables which would eliminate typeId from the derived tables. Unfortunately SQL Server doesn't support that feature.)

There you have it: The model is flexible and can deal with pretty much any single inheritance scenario. It's also robust and guards against most types of invalid data. The only outstanding problem is that it's possible to add a row to the base table without a correspondent in the correct derived table. Unfortunately that's not something that can be addressed by the model (not that I'm aware of) but there are a few ways to deal with the issue. Personally, I would use stored procedures for the inserts, updates, and deletes. If T-SQL isn't your thing, try building the correct logic into the data access layer of your application instead.

Now go have a good look at your database model. Can you find any tables that could this technique?

15 February 2010

Speed up your queries: convert dates into ids

This is a quick tip on query performance tuning. As always, don't forget to measure the performance against real data before applying any changes to a production environment.

Slow running queries, especially reports, often include a seemingly inoffensive range of dates in the where clause. Ultimately, the end-user wants to know "how many sales were completed last week" or "how many products were added in January". Take this simple example:

select *
from T1
where created_on between @startDate and @endDate
-- other criteria

This query needs an index on the created_on column to perform well. However, if the where clause also references other columns, the optimizer might decide to ignore that index and use a clustered index scan instead, mainly to avoid bookmark lookups. The end result is that the query has to read most, if not all, of the table to find candidate rows. This issue only gets worse if the query joins other tables using the primary key.

As long as the date range uses dates in the same order as the primary key and clustered index, we can substitute it for a range of ids. Generally, a column that stores the date and time of when the row was added is a good candidate for this technique. Let's try rewriting the query like this:

declare @startId int
select top 1 @startId = id
from T1
where created_on >= @startDate
order by created_on, id

declare @endId int
select top 1 @endId = id
from T1
where created_on <= @endDate
order by created_on desc, id desc

select *
from T1
where id between @startId and @endId
-- other criteria

It looks a lot more complicated but the basic idea is simple. First, we figure out the id of the first row in the date range. Then we do the same for the last row. These two queries will use the same index on created_on as the original query but they will run very fast since it's a simple index seek. Finally, the main query uses the same logic as before except with start and end ids instead of dates. Now the clustered index scan can skip most of the table and concentrate on the rows that are in the correct range.

That's it! The query should now run considerably faster with only a slight increase in complexity.

08 February 2010

The table with a tiny key

Whenever I model a new database table I like to start with an auto-incrementing primary key column. Maybe it's the Ruby on Rails convention creeping up on me, but I've even adopted the minimalist id as my default name for such columns. A lot of times I go ahead and make it an int. Sounds pretty good, huh? After all, the good old int is the de facto standard when it comes to integer data types. What else would you use, right?

Since you ask, there are a few alternatives out there: smallints, tinyints and bigints. Let's start with bigints. I don't think the name does it justice. Bigints aren't just big - they are massive. Massive in the range they can represent but also pretty hefty in the storage space they require: 8 bytes instead of the usual 4 an int takes. How can an additional 4 bytes be a problem? We'll come to that in a second but for right now, just ask yourself if your table will eventually have more than 2 billion rows. In most cases, that's a very reasonable upper limit. For example, if you started with an empty table and added a new row every second it would still take you approximately 60 years to reach that limit.

OK, so you probably don't need a bigint unless you're Google, Flickr or maybe Walmart. What about the other data types? Smallints and tinyints can't represent the large range of numbers that ints can. To refresh your memory, smallints go up to a little over 32,000 and of course tinyints a mere 255. And that's precisely where things get interesting. You see, the truth is, any database has a bunch of tables that have very few rows each. They might be domain tables like MaritalStatus ("Single", "Married", "Divorced", etc.) or PhoneType ("Work", "Home" or "Cell/Mobile"). Other tables might be limited by the nature of the data they store. For example, a table called WorldAirports is probably going to have less than 50,000 rows. My point is why use an int as a unique identifier for each row if you're not going to have that many rows? Why not use a smallint and save 2 bytes per row. Even better, use a tinyint and chop 3 bytes off every row. If that doesn't seem like a lot, try looking at it this way: a tinyint requires 75% less space than an int.

Now, wait a minute! What's the point of optimizing a table if it only has a few rows anyway? The total savings are going to be negligable! Well, not quite... The magic of these small tables is that their primary keys get used as foreign keys with the same data type in much larger tables. Let's say you have a PhoneDirectory table with one million phone numbers and each phone number has a PhoneTypeId that's a foreign key from PhoneType. The PhoneDirectory.PhoneTypeId column alone would take up 4MB of storage as an int but only 1MB as a tinyint. Not bad, huh?

But disk space is cheap and it's measured in terabytes these days. One megabyte, four megabytes, who cares? While there's certainly a lot to be said for throwing more hardware (in this case, storage) at a problem, those savings can add up and remember that a smaller database doesn't just mean less disk usage. It also means faster backups and restores, less replication errors, and smaller bandwidth usage for remote archiving and DR. Did I mention better performance? That's right, smaller rows means you can fit more of them in each page of data and therefore reduce the number of reads and writes.

Next time you create a table, have a think about that and choose your data types accordingly. Or get started on refactoring your existing tables right now.

01 February 2010

Data Object Interfaces in Linq

I'm currently working on a database where a lot of tables have a common group of columns, like this:

created_on datetime not null,
created_by smallint not null,
updated_on datetime not null,
updated_by smallint not null

The smallints are foreign keys from the User table. Together they provide first level auditability to answer questions like "When was each row created and by whom?" and "Which user made the last edit to this entity?". They're present in every table that has CRUD operations exposed in the application (that includes User itself). While the columns are always exactly the same, I prefer to keep them inline rather than make a separate table since I don't think they form an entity in their own right.

Anyway, there are places in the application where I just want to deal with those columns regardless of which table they came from. The footer that says "Created by user1 on ..." is a perfect example. The aim is to have all the classes that Visual Studio automatically generates implement an interface that encapsulates the columns above. Something like this:

public interface AuditInformation {
    DateTime created_on { get; }
    short created_by { get; }
    DateTime updated_on { get; }
    short updated_on { get; }
}

How can we achieve that? Remember that those classes are re-written every time you use the design surface so directly editing them is a clear no-no. Well, if all your tables use columns defined exactly the same, the classes already implement the interface - you just have to let the compiler know that. (This is where being very consistent in your naming and choice of datatypes can pay off nicely.)

Instead of editing the classes themselves, we can extend them using partial definitions. Consider a Product table and class. We can add this code to a separate file:

public partial class Product : AuditInformation { }

Is that it? Actually yes! Like I said, if you created your columns correctly, the auto-generated class already implements the interface because it contains exactly the right properties. Now your CreatedByFooter user control (or helper method) can reference the interface without caring which class is really being used.

For extra bonus points notice that you aren't limited to the column values. You can also use the relationship properties generated from the foreign keys. In other words, the Product class probably contains a User property that corresponds to the created_by column (returning a User object in its full glory instead of a plain user id value). You can add that property to the interface too. The only tricky bit is that you have to manually fix the relationship names in Visual Studio when you import the tables. In the example above, by default, the created_by relationship is called User and the updated_by User1 (sometimes, the other way around). That's not very intuitive so rename the relationships to something sensible like CreatedBy and UpdatedBy and add those same names to the interface.

That's it, job done!

The Minimalist Developer